Age | Commit message (Collapse) | Author |
|
commit 9c145c56d0c8a0b62e48c8d71e055ad0fb2012ba upstream.
The stack guard page error case has long incorrectly caused a SIGBUS
rather than a SIGSEGV, but nobody actually noticed until commit
fee7e49d4514 ("mm: propagate error from stack expansion even for guard
page") because that error case was never actually triggered in any
normal situations.
Now that we actually report the error, people noticed the wrong signal
that resulted. So far, only the test suite of libsigsegv seems to have
actually cared, but there are real applications that use libsigsegv, so
let's not wait for any of those to break.
Reported-and-tested-by: Takashi Iwai <tiwai@suse.de>
Tested-by: Jan Engelhardt <jengelh@inai.de>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
Cc: linux-arch@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 33692f27597fcab536d7cbbcc8f52905133e4aa7 upstream.
The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
"you should SIGSEGV" error, because the SIGSEGV case was generally
handled by the caller - usually the architecture fault handler.
That results in lots of duplication - all the architecture fault
handlers end up doing very similar "look up vma, check permissions, do
retries etc" - but it generally works. However, there are cases where
the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.
In particular, when accessing the stack guard page, libsigsegv expects a
SIGSEGV. And it usually got one, because the stack growth is handled by
that duplicated architecture fault handler.
However, when the generic VM layer started propagating the error return
from the stack expansion in commit fee7e49d4514 ("mm: propagate error
from stack expansion even for guard page"), that now exposed the
existing VM_FAULT_SIGBUS result to user space. And user space really
expected SIGSEGV, not SIGBUS.
To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
duplicate architecture fault handlers about it. They all already have
the code to handle SIGSEGV, so it's about just tying that new return
value to the existing code, but it's all a bit annoying.
This is the mindless minimal patch to do this. A more extensive patch
would be to try to gather up the mostly shared fault handling logic into
one generic helper routine, and long-term we really should do that
cleanup.
Just from this patch, you can generally see that most architectures just
copied (directly or indirectly) the old x86 way of doing things, but in
the meantime that original x86 model has been improved to hold the VM
semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
"newer" things, so it would be a good idea to bring all those
improvements to the generic case and teach other architectures about
them too.
Reported-and-tested-by: Takashi Iwai <tiwai@suse.de>
Tested-by: Jan Engelhardt <jengelh@inai.de>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
Cc: linux-arch@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2:
- Adjust filenames, context
- Drop arc, metag, nios2 and lustre changes
- For sh, patch both 32-bit and 64-bit implementations to use goto bad_area
- For s390, pass int_code and trans_exc_code as arguments to do_no_context()
and do_sigsegv()]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit b800c91a0517071156e772d4fb329ad33590da62 upstream.
Fix for BUG_ON(anon_vma->degree) splashes in unlink_anon_vmas() ("kernel
BUG at mm/rmap.c:399!") caused by commit 7a3ef208e662 ("mm: prevent
endless growth of anon_vma hierarchy")
Anon_vma_clone() is usually called for a copy of source vma in
destination argument. If source vma has anon_vma it should be already
in dst->anon_vma. NULL in dst->anon_vma is used as a sign that it's
called from anon_vma_fork(). In this case anon_vma_clone() finds
anon_vma for reusing.
Vma_adjust() calls it differently and this breaks anon_vma reusing
logic: anon_vma_clone() links vma to old anon_vma and updates degree
counters but vma_adjust() overrides vma->anon_vma right after that. As
a result final unlink_anon_vmas() decrements degree for wrong anon_vma.
This patch assigns ->anon_vma before calling anon_vma_clone().
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Reported-and-tested-by: Chris Clayton <chris2553@googlemail.com>
Reported-and-tested-by: Oded Gabbay <oded.gabbay@amd.com>
Reported-and-tested-by: Chih-Wei Huang <cwhuang@android-x86.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Daniel Forrest <dan.forrest@ssec.wisc.edu>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: vma_adjust() didn't use a variable to propagate
the error code from anon_vma_clone(); change that at the same time]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 690eac53daff34169a4d74fc7bfbd388c4896abb upstream.
Commit fee7e49d4514 ("mm: propagate error from stack expansion even for
guard page") made sure that we return the error properly for stack
growth conditions. It also theorized that counting the guard page
towards the stack limit might break something, but also said "Let's see
if anybody notices".
Somebody did notice. Apparently android-x86 sets the stack limit very
close to the limit indeed, and including the guard page in the rlimit
check causes the android 'zygote' process problems.
So this adds the (fairly trivial) code to make the stack rlimit check be
against the actual real stack size, rather than the size of the vma that
includes the guard page.
Reported-and-tested-by: Chih-Wei Huang <cwhuang@android-x86.org>
Cc: Jay Foad <jay.foad@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 2d6d7f98284648c5ed113fe22a132148950b140f upstream.
Tejun, while reviewing the code, spotted the following race condition
between the dirtying and truncation of a page:
__set_page_dirty_nobuffers() __delete_from_page_cache()
if (TestSetPageDirty(page))
page->mapping = NULL
if (PageDirty())
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
if (page->mapping)
account_page_dirtied(page)
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
Dirtiers usually lock out truncation, either by holding the page lock
directly, or in case of zap_pte_range(), by pinning the mapcount with
the page table lock held. The notable exception to this rule, though,
is do_wp_page(), for which this race exists. However, do_wp_page()
already waits for a locked page to unlock before setting the dirty bit,
in order to prevent a race where clear_page_dirty() misses the page bit
in the presence of dirty ptes. Upgrade that wait to a fully locked
set_page_dirty() to also cover the situation explained above.
Afterwards, the code in set_page_dirty() dealing with a truncation race
is no longer needed. Remove it.
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2:
- Adjust context
- Use VM_BUG_ON() rather than VM_BUG_ON_PAGE()]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit ed6d7c8e578331cad594ee70d60e2e146b5dce7b upstream.
There's only one caller of set_page_dirty_balance() and that will call it
with page_mkwrite == 0.
The page_mkwrite argument was unused since commit b827e496c893 "mm: close
page_mkwrite races".
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 7a3ef208e662f4b63d43a23f61a64a129c525bbc upstream.
Constantly forking task causes unlimited grow of anon_vma chain. Each
next child allocates new level of anon_vmas and links vma to all
previous levels because pages might be inherited from any level.
This patch adds heuristic which decides to reuse existing anon_vma
instead of forking new one. It adds counter anon_vma->degree which
counts linked vmas and directly descending anon_vmas and reuses anon_vma
if counter is lower than two. As a result each anon_vma has either vma
or at least two descending anon_vmas. In such trees half of nodes are
leafs with alive vmas, thus count of anon_vmas is no more than two times
bigger than count of vmas.
This heuristic reuses anon_vmas as few as possible because each reuse
adds false aliasing among vmas and rmap walker ought to scan more ptes
when it searches where page is might be mapped.
Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu
Fixes: 5beb49305251 ("mm: change anon_vma linking to fix multi-process server scalability issue")
[akpm@linux-foundation.org: fix typo, per Rik]
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Reported-by: Daniel Forrest <dan.forrest@ssec.wisc.edu>
Tested-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Jerome Marchand <jmarchan@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit fee7e49d45149fba60156f5b59014f764d3e3728 upstream.
Jay Foad reports that the address sanitizer test (asan) sometimes gets
confused by a stack pointer that ends up being outside the stack vma
that is reported by /proc/maps.
This happens due to an interaction between RLIMIT_STACK and the guard
page: when we do the guard page check, we ignore the potential error
from the stack expansion, which effectively results in a missing guard
page, since the expected stack expansion won't have been done.
And since /proc/maps explicitly ignores the guard page (commit
d7824370e263: "mm: fix up some user-visible effects of the stack guard
page"), the stack pointer ends up being outside the reported stack area.
This is the minimal patch: it just propagates the error. It also
effectively makes the guard page part of the stack limit, which in turn
measn that the actual real stack is one page less than the stack limit.
Let's see if anybody notices. We could teach acct_stack_growth() to
allow an extra page for a grow-up/grow-down stack in the rlimit test,
but I don't want to add more complexity if it isn't needed.
Reported-and-tested-by: Jay Foad <jay.foad@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 2022b4d18a491a578218ce7a4eca8666db895a73 upstream.
I've been seeing swapoff hangs in recent testing: it's cycling around
trying unsuccessfully to find an mm for some remaining pages of swap.
I have been exercising swap and page migration more heavily recently,
and now notice a long-standing error in copy_one_pte(): it's trying to
add dst_mm to swapoff's mmlist when it finds a swap entry, but is doing
so even when it's a migration entry or an hwpoison entry.
Which wouldn't matter much, except it adds dst_mm next to src_mm,
assuming src_mm is already on the mmlist: which may not be so. Then if
pages are later swapped out from dst_mm, swapoff won't be able to find
where to replace them.
There's already a !non_swap_entry() test for stats: move that up before
the swap_duplicate() and the addition to mmlist.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Kelley Nielsen <kelleynnn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit f55fefd1a5a339b1bd08c120b93312d6eb64a9fb upstream.
The WARN_ON checking whether i_mutex is held in
pagecache_isize_extended() was wrong because some filesystems (e.g.
XFS) use different locks for serialization of truncates / writes. So
just remove the check.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 6d50e60cd2edb5a57154db5a6f64eef5aa59b751 upstream.
If an anonymous mapping is not allowed to fault thp memory and then
madvise(MADV_HUGEPAGE) is used after fault, khugepaged will never
collapse this memory into thp memory.
This occurs because the madvise(2) handler for thp, hugepage_madvise(),
clears VM_NOHUGEPAGE on the stack and it isn't stored in vma->vm_flags
until the final action of madvise_behavior(). This causes the
khugepaged_enter_vma_merge() to be a no-op in hugepage_madvise() when
the vma had previously had VM_NOHUGEPAGE set.
Fix this by passing the correct vma flags to the khugepaged mm slot
handler. There's no chance khugepaged can run on this vma until after
madvise_behavior() returns since we hold mm->mmap_sem.
It would be possible to clear VM_NOHUGEPAGE directly from vma->vm_flags
in hugepage_advise(), but I didn't want to introduce special case
behavior into madvise_behavior(). I think it's best to just let it
always set vma->vm_flags itself.
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Suleiman Souhlal <suleiman@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: adjust context, indentation]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 401507d67d5c2854f5a88b3f93f64fc6f267bca5 upstream.
Commit ff7ee93f4715 ("cgroup/kmemleak: Annotate alloc_page() for cgroup
allocations") introduces kmemleak_alloc() for alloc_page_cgroup(), but
corresponding kmemleak_free() is missing, which makes kmemleak be
wrongly disabled after memory offlining. Log is pasted at the end of
this commit message.
This patch add kmemleak_free() into free_page_cgroup(). During page
offlining, this patch removes corresponding entries in kmemleak rbtree.
After that, the freed memory can be allocated again by other subsystems
without killing kmemleak.
bash # for x in 1 2 3 4; do echo offline > /sys/devices/system/memory/memory$x/state ; sleep 1; done ; dmesg | grep leak
Offlined Pages 32768
kmemleak: Cannot insert 0xffff880016969000 into the object search tree (overlaps existing)
CPU: 0 PID: 412 Comm: sleep Not tainted 3.17.0-rc5+ #86
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x46/0x58
create_object+0x266/0x2c0
kmemleak_alloc+0x26/0x50
kmem_cache_alloc+0xd3/0x160
__sigqueue_alloc+0x49/0xd0
__send_signal+0xcb/0x410
send_signal+0x45/0x90
__group_send_sig_info+0x13/0x20
do_notify_parent+0x1bb/0x260
do_exit+0x767/0xa40
do_group_exit+0x44/0xa0
SyS_exit_group+0x17/0x20
system_call_fastpath+0x16/0x1b
kmemleak: Kernel memory leak detector disabled
kmemleak: Object 0xffff880016900000 (size 524288):
kmemleak: comm "swapper/0", pid 0, jiffies 4294667296
kmemleak: min_count = 0
kmemleak: count = 0
kmemleak: flags = 0x1
kmemleak: checksum = 0
kmemleak: backtrace:
log_early+0x63/0x77
kmemleak_alloc+0x4b/0x50
init_section_page_cgroup+0x7f/0xf5
page_cgroup_init+0xc5/0xd0
start_kernel+0x333/0x408
x86_64_start_reservations+0x2a/0x2c
x86_64_start_kernel+0xf5/0xfc
Fixes: ff7ee93f4715 (cgroup/kmemleak: Annotate alloc_page() for cgroup allocations)
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit ce9ec37bddb633404a0c23e1acb181a264e7f7f2 upstream.
When unmapping a range of pages in zap_pte_range, the page being
unmapped is added to an mmu_gather_batch structure for asynchronous
freeing. If we run out of space in the batch structure before the range
has been completely unmapped, then we break out of the loop, force a
TLB flush and free the pages that we have batched so far. If there are
further pages to unmap, then we resume the loop where we left off.
Unfortunately, we forget to update addr when we break out of the loop,
which causes us to truncate the range being invalidated as the end
address is exclusive. When we re-enter the loop at the same address, the
page has already been freed and the pte_present test will fail, meaning
that we do not reconsider the address for invalidation.
This patch fixes the problem by incrementing addr by the PAGE_SIZE
before breaking out of the loop on batch failure.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: adjust context; add braces]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 90a8020278c1598fafd071736a0846b38510309c upstream.
->page_mkwrite() is used by filesystems to allocate blocks under a page
which is becoming writeably mmapped in some process' address space. This
allows a filesystem to return a page fault if there is not enough space
available, user exceeds quota or similar problem happens, rather than
silently discarding data later when writepage is called.
However VFS fails to call ->page_mkwrite() in all the cases where
filesystems need it when blocksize < pagesize. For example when
blocksize = 1024, pagesize = 4096 the following is problematic:
ftruncate(fd, 0);
pwrite(fd, buf, 1024, 0);
map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
map[0] = 'a'; ----> page_mkwrite() for index 0 is called
ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
mremap(map, 1024, 10000, 0);
map[4095] = 'a'; ----> no page_mkwrite() called
At the moment ->page_mkwrite() is called, filesystem can allocate only
one block for the page because i_size == 1024. Otherwise it would create
blocks beyond i_size which is generally undesirable. But later at
->writepage() time, we also need to store data at offset 4095 but we
don't have block allocated for it.
This patch introduces a helper function filesystems can use to have
->page_mkwrite() called at all the necessary moments.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
[bwh: Backported to 3.2:
- Adjust context
- truncate_setsize() already has an oldsize variable]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit d3cb8bf6081b8b7a2dabb1264fe968fd870fa595 upstream.
A migration entry is marked as write if pte_write was true at the time the
entry was created. The VMA protections are not double checked when migration
entries are being removed as mprotect marks write-migration-entries as
read. It means that potentially we take a spurious fault to mark PTEs write
again but it's straight-forward. However, there is a race between write
migrations being marked read and migrations finishing. This potentially
allows a PTE to be write that should have been read. Close this race by
double checking the VMA permissions using maybe_mkwrite when migration
completes.
[torvalds@linux-foundation.org: use maybe_mkwrite]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit b928095b0a7cff7fb9fcf4c706348ceb8ab2c295 upstream.
If overwriting an empty directory with rename, then need to drop the extra
nlink.
Test prog:
#include <stdio.h>
#include <fcntl.h>
#include <err.h>
#include <sys/stat.h>
int main(void)
{
const char *test_dir1 = "test-dir1";
const char *test_dir2 = "test-dir2";
int res;
int fd;
struct stat statbuf;
res = mkdir(test_dir1, 0777);
if (res == -1)
err(1, "mkdir(\"%s\")", test_dir1);
res = mkdir(test_dir2, 0777);
if (res == -1)
err(1, "mkdir(\"%s\")", test_dir2);
fd = open(test_dir2, O_RDONLY);
if (fd == -1)
err(1, "open(\"%s\")", test_dir2);
res = rename(test_dir1, test_dir2);
if (res == -1)
err(1, "rename(\"%s\", \"%s\")", test_dir1, test_dir2);
res = fstat(fd, &statbuf);
if (res == -1)
err(1, "fstat(%i)", fd);
if (statbuf.st_nlink != 0) {
fprintf(stderr, "nlink is %lu, should be 0\n", statbuf.st_nlink);
return 1;
}
return 0;
}
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 3189eddbcafcc4d827f7f19facbeddec4424eba8 upstream.
Currently, only SMP system free the percpu allocation info.
Uniprocessor system should free it too. For example, one x86 UML
virtual machine with 256MB memory, UML kernel wastes one page memory.
Signed-off-by: Honggang Li <enjoymindful@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 849f5169097e1ba35b90ac9df76b5bb6f9c0aabd upstream.
If pcpu_map_pages() fails midway, it unmaps the already mapped pages.
Currently, it doesn't flush tlb after the partial unmapping. This may
be okay in most cases as the established mapping hasn't been used at
that point but it can go wrong and when it goes wrong it'd be
extremely difficult to track down.
Flush tlb after the partial unmapping.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit f0d279654dea22b7a6ad34b9334aee80cda62cde upstream.
When pcpu_alloc_pages() fails midway, pcpu_free_pages() is invoked to
free what has already been allocated. The invocation is across the
whole requested range and pcpu_free_pages() will try to free all
non-NULL pages; unfortunately, this is incorrect as
pcpu_get_pages_and_bitmap(), unlike what its comment suggests, doesn't
clear the pages array and thus the array may have entries from the
previous invocations making the partial failure path free incorrect
pages.
Fix it by open-coding the partial freeing of the already allocated
pages.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit e7b691b085fda913830e5280ae6f724b2a63c824 upstream.
slab_node() could access current->mempolicy from interrupt context.
However there's a race condition during exit where the mempolicy
is first freed and then the pointer zeroed.
Using this from interrupts seems bogus anyways. The interrupt
will interrupt a random process and therefore get a random
mempolicy. Many times, this will be idle's, which noone can change.
Just disable this here and always use local for slab
from interrupts. I also cleaned up the callers of slab_node a bit
which always passed the same argument.
I believe the original mempolicy code did that in fact,
so it's likely a regression.
v2: send version with correct logic
v3: simplify. fix typo.
Reported-by: Arun Sharma <asharma@fb.com>
Cc: penberg@kernel.org
Cc: cl@linux.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
[tdmackey@twitter.com: Rework control flow based on feedback from
cl@linux.com, fix logic, and cleanup current task_struct reference]
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Mackey <tdmackey@twitter.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit b104a35d32025ca740539db2808aa3385d0f30eb upstream.
The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET
should be set in allocflags. ALLOC_CPUSET controls if a page allocation
should be restricted only to the set of allowed cpuset mems.
Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent
the fault path from using memory compaction or direct reclaim. Thus, it
is unfairly able to allocate outside of its cpuset mems restriction as a
side-effect.
This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is
truly GFP_ATOMIC by verifying it is also not a thp allocation.
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Alex Thorlton <athorlton@sgi.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hedi Berriche <hedi@sgi.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 7f88f88f83ed609650a01b18572e605ea50cd163 upstream.
Commit 248ac0e1943a ("mm/vmalloc: remove guard page from between vmap
blocks") had the side effect of making vmap_area.va_end member point to
the next vmap_area.va_start. This was creating an artificial reference
to vmalloc'ed objects and kmemleak was rarely reporting vmalloc() leaks.
This patch marks the vmap_area containing pointers explicitly and
reduces the min ref_count to 2 as vm_struct still contains a reference
to the vmalloc'ed object. The kmemleak add_scan_area() function has
been improved to allow a SIZE_MAX argument covering the rest of the
object (for simpler calling sites).
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 0253d634e0803a8376a0d88efee0bf523d8673f9 upstream.
Commit 4a705fef9862 ("hugetlb: fix copy_hugetlb_page_range() to handle
migration/hwpoisoned entry") changed the order of
huge_ptep_set_wrprotect() and huge_ptep_get(), which leads to breakage
in some workloads like hugepage-backed heap allocation via libhugetlbfs.
This patch fixes it.
The test program for the problem is shown below:
$ cat heap.c
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#define HPS 0x200000
int main() {
int i;
char *p = malloc(HPS);
memset(p, '1', HPS);
for (i = 0; i < 5; i++) {
if (!fork()) {
memset(p, '2', HPS);
p = malloc(HPS);
memset(p, '3', HPS);
free(p);
return 0;
}
}
sleep(1);
free(p);
return 0;
}
$ export HUGETLB_MORECORE=yes ; export HUGETLB_NO_PREFAULT= ; hugectl --heap ./heap
Fixes 4a705fef9862 ("hugetlb: fix copy_hugetlb_page_range() to handle
migration/hwpoisoned entry"), so is applicable to -stable kernels which
include it.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Guillaume Morin <guillaume@morinfr.org>
Suggested-by: Guillaume Morin <guillaume@morinfr.org>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit b1a366500bd537b50c3aad26dc7df083ec03a448 upstream.
shmem_fault() is the actual culprit in trinity's hole-punch starvation,
and the most significant cause of such problems: since a page faulted is
one that then appears page_mapped(), needing unmap_mapping_range() and
i_mmap_mutex to be unmapped again.
But it is not the only way in which a page can be brought into a hole in
the radix_tree while that hole is being punched; and Vlastimil's testing
implies that if enough other processors are busy filling in the hole,
then shmem_undo_range() can be kept from completing indefinitely.
shmem_file_splice_read() is the main other user of SGP_CACHE, which can
instantiate shmem pagecache pages in the read-only case (without holding
i_mutex, so perhaps concurrently with a hole-punch). Probably it's
silly not to use SGP_READ already (using the ZERO_PAGE for holes): which
ought to be safe, but might bring surprises - not a change to be rushed.
shmem_read_mapping_page_gfp() is an internal interface used by
drivers/gpu/drm GEM (and next by uprobes): it should be okay. And
shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when
called internally by the kernel (perhaps for a stacking filesystem,
which might rely on holes to be reserved): it's unclear whether it could
be provoked to keep hole-punch busy or not.
We could apply the same umbrella as now used in shmem_fault() to
shmem_file_splice_read() and the others; but it looks ugly, and use over
a range raises questions - should it actually be per page? can these get
starved themselves?
The origin of this part of the problem is my v3.1 commit d0823576bf4b
("mm: pincer in truncate_inode_pages_range"), once it was duplicated
into shmem.c. It seemed like a nice idea at the time, to ensure
(barring RCU lookup fuzziness) that there's an instant when the entire
hole is empty; but the indefinitely repeated scans to ensure that make
it vulnerable.
Revert that "enhancement" to hole-punch from shmem_undo_range(), but
retain the unproblematic rescanning when it's truncating; add a couple
of comments there.
Remove the "indices[0] >= end" test: that is now handled satisfactorily
by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
to be worth avoiding here.
But if we do not always loop indefinitely, we do need to handle the case
of swap swizzled back to page before shmem_free_swap() gets it: add a
retry for that case, as suggested by Konstantin Khlebnikov; and for the
case of page swizzled back to swap, as suggested by Johannes Weiner.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 8e205f779d1443a94b5ae81aa359cb535dd3021e upstream.
Commit f00cdc6df7d7 ("shmem: fix faulting into a hole while it's
punched") was buggy: Sasha sent a lockdep report to remind us that
grabbing i_mutex in the fault path is a no-no (write syscall may already
hold i_mutex while faulting user buffer).
We tried a completely different approach (see following patch) but that
proved inadequate: good enough for a rational workload, but not good
enough against trinity - which forks off so many mappings of the object
that contention on i_mmap_mutex while hole-puncher holds i_mutex builds
into serious starvation when concurrent faults force the puncher to fall
back to single-page unmap_mapping_range() searches of the i_mmap tree.
So return to the original umbrella approach, but keep away from i_mutex
this time. We really don't want to bloat every shmem inode with a new
mutex or completion, just to protect this unlikely case from trinity.
So extend the original with wait_queue_head on stack at the hole-punch
end, and wait_queue item on the stack at the fault end.
This involves further use of i_lock to guard against the races: lockdep
has been happy so far, and I see fs/inode.c:unlock_new_inode() holds
i_lock around wake_up_bit(), which is comparable to what we do here.
i_lock is more convenient, but we could switch to shmem's info->lock.
This issue has been tagged with CVE-2014-4171, which will require commit
f00cdc6df7d7 and this and the following patch to be backported: we
suggest to 3.1+, though in fact the trinity forkbomb effect might go
back as far as 2.6.16, when madvise(,,MADV_REMOVE) came in - or might
not, since much has changed, with i_mmap_mutex a spinlock before 3.0.
Anyone running trinity on 3.0 and earlier? I don't think we need care.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit f00cdc6df7d7cfcabb5b740911e6788cb0802bdb upstream.
Trinity finds that mmap access to a hole while it's punched from shmem
can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
from completing, until the reader chooses to stop; with the puncher's
hold on i_mutex locking out all other writers until it can complete.
It appears that the tmpfs fault path is too light in comparison with its
hole-punching path, lacking an i_data_sem to obstruct it; but we don't
want to slow down the common case.
Extend shmem_fallocate()'s existing range notification mechanism, so
shmem_fault() can refrain from faulting pages into the hole while it's
punched, waiting instead on i_mutex (when safe to sleep; or repeatedly
faulting when not).
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 391acf970d21219a2a5446282d3b20eace0c0d7a upstream.
When runing with the kernel(3.15-rc7+), the follow bug occurs:
[ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
[ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
[ 9969.441175] INFO: lockdep is turned off.
[ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85
[ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
[ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
[ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
[ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
[ 9969.974071] Call Trace:
[ 9970.003403] [<ffffffff8162f523>] dump_stack+0x4d/0x66
[ 9970.065074] [<ffffffff8109995a>] __might_sleep+0xfa/0x130
[ 9970.130743] [<ffffffff81633e6c>] mutex_lock_nested+0x3c/0x4f0
[ 9970.200638] [<ffffffff811ba5dc>] ? kmem_cache_alloc+0x1bc/0x210
[ 9970.272610] [<ffffffff81105807>] cpuset_mems_allowed+0x27/0x140
[ 9970.344584] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
[ 9970.409282] [<ffffffff811b1385>] __mpol_dup+0xe5/0x150
[ 9970.471897] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
[ 9970.536585] [<ffffffff81068c86>] ? copy_process.part.23+0x606/0x1d40
[ 9970.613763] [<ffffffff810bf28d>] ? trace_hardirqs_on+0xd/0x10
[ 9970.683660] [<ffffffff810ddddf>] ? monotonic_to_bootbased+0x2f/0x50
[ 9970.759795] [<ffffffff81068cf0>] copy_process.part.23+0x670/0x1d40
[ 9970.834885] [<ffffffff8106a598>] do_fork+0xd8/0x380
[ 9970.894375] [<ffffffff81110e4c>] ? __audit_syscall_entry+0x9c/0xf0
[ 9970.969470] [<ffffffff8106a8c6>] SyS_clone+0x16/0x20
[ 9971.030011] [<ffffffff81642009>] stub_clone+0x69/0x90
[ 9971.091573] [<ffffffff81641c29>] ? system_call_fastpath+0x16/0x1b
The cause is that cpuset_mems_allowed() try to take
mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
__mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
protection region to protect the access to cpuset only in
current_cpuset_is_being_rebound(). So that we can avoid this bug.
This patch is a temporary solution that just addresses the bug
mentioned above, can not fix the long-standing issue about cpuset.mems
rebinding on fork():
"When the forker's task_struct is duplicated (which includes
->mems_allowed) and it races with an update to cpuset_being_rebound
in update_tasks_nodemask() then the task's mems_allowed doesn't get
updated. And the child task's mems_allowed can be wrong if the
cpuset's nodemask changes before the child has been added to the
cgroup's tasklist."
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit d05f0cdcbe6388723f1900c549b4850360545201 upstream.
In v2.6.34 commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
introduced vma merging to mbind(), but it should have also changed the
convention of passing start vma from queue_pages_range() (formerly
check_range()) to new_vma_page(): vma merging may have already freed
that structure, resulting in BUG at mm/mempolicy.c:1738 and probably
worse crashes.
Fixes: 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Tested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2:
- Adjust context
- Keep the same arguments to migrate_pages() except for private=start]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 082708072a4250f5c4dbc62065e7af93f5e45646 upstream.
Revert commit 0def08e3acc2 because check_range can't fail in
migrate_to_node with considering current usecases.
Quote from Johannes
: I think it makes sense to revert. Not because of the semantics, but I
: just don't see how check_range() could even fail for this callsite:
:
: 1. we pass mm->mmap->vm_start in there, so we should not fail due to
: find_vma()
:
: 2. we pass MPOL_MF_DISCONTIG_OK, so the discontig checks do not apply
: and so can not fail
:
: 3. we pass MPOL_MF_MOVE | MPOL_MF_MOVE_ALL, the page table loops will
: continue until addr == end, so we never fail with -EIO
And I added a new VM_BUG_ON for checking migrate_to_node's future usecase
which might pass to MPOL_MF_STRICT.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vasiliy Kulikov <segooon@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 4a705fef986231a3e7a6b1a6d3c37025f021f49f upstream.
There's a race between fork() and hugepage migration, as a result we try
to "dereference" a swap entry as a normal pte, causing kernel panic.
The cause of the problem is that copy_hugetlb_page_range() can't handle
"swap entry" family (migration entry and hwpoisoned entry) so let's fix
it.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 624483f3ea82598ab0f62f1bdb9177f531ab1892 upstream.
While working address sanitizer for kernel I've discovered
use-after-free bug in __put_anon_vma.
For the last anon_vma, anon_vma->root freed before child anon_vma.
Later in anon_vma_free(anon_vma) we are referencing to already freed
anon_vma->root to check rwsem.
This fixes it by freeing the child anon_vma before freeing
anon_vma->root.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 71abdc15adf8c702a1dd535f8e30df50758848d2 upstream.
When kswapd exits, it can end up taking locks that were previously held
by allocating tasks while they waited for reclaim. Lockdep currently
warns about this:
On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
> inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
> kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
> (&sig->group_rwsem){+++++?}, at: exit_signals+0x24/0x130
> {RECLAIM_FS-ON-W} state was registered at:
> mark_held_locks+0xb9/0x140
> lockdep_trace_alloc+0x7a/0xe0
> kmem_cache_alloc_trace+0x37/0x240
> flex_array_alloc+0x99/0x1a0
> cgroup_attach_task+0x63/0x430
> attach_task_by_pid+0x210/0x280
> cgroup_procs_write+0x16/0x20
> cgroup_file_write+0x120/0x2c0
> vfs_write+0xc0/0x1f0
> SyS_write+0x4c/0xa0
> tracesys+0xdd/0xe2
> irq event stamp: 49
> hardirqs last enabled at (49): _raw_spin_unlock_irqrestore+0x36/0x70
> hardirqs last disabled at (48): _raw_spin_lock_irqsave+0x2b/0xa0
> softirqs last enabled at (0): copy_process.part.24+0x627/0x15f0
> softirqs last disabled at (0): (null)
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock(&sig->group_rwsem);
> <Interrupt>
> lock(&sig->group_rwsem);
>
> *** DEADLOCK ***
>
> no locks held by kswapd2/1151.
>
> stack backtrace:
> CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
> Call Trace:
> dump_stack+0x19/0x1b
> print_usage_bug+0x1f7/0x208
> mark_lock+0x21d/0x2a0
> __lock_acquire+0x52a/0xb60
> lock_acquire+0xa2/0x140
> down_read+0x51/0xa0
> exit_signals+0x24/0x130
> do_exit+0xb5/0xa50
> kthread+0xdb/0x100
> ret_from_fork+0x7c/0xb0
This is because the kswapd thread is still marked as a reclaimer at the
time of exit. But because it is exiting, nobody is actually waiting on
it to make reclaim progress anymore, and it's nothing but a regular
thread at this point. Be tidy and strip it of all its powers
(PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state)
before returning from the thread function.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 7f39dda9d86fb4f4f17af0de170decf125726f8c upstream.
Trinity reports BUG:
sleeping function called from invalid context at kernel/locking/rwsem.c:47
in_atomic(): 0, irqs_disabled(): 0, pid: 5787, name: trinity-c27
__might_sleep < down_write < __put_anon_vma < page_get_anon_vma <
migrate_pages < compact_zone < compact_zone_order < try_to_compact_pages ..
Right, since conversion to mutex then rwsem, we should not put_anon_vma()
from inside an rcu_read_lock()ed section: fix the two places that did so.
And add might_sleep() to anon_vma_free(), as suggested by Peter Zijlstra.
Fixes: 88c22088bf23 ("mm: optimize page_lock_anon_vma() fast-path")
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 498c2280212327858e521e9d21345d4cc2637f54 upstream.
kmap_to_page returns the corresponding struct page for a virtual address
of an arbitrary mapping. This works by checking whether the address
falls in the pkmap region and using the pkmap page tables instead of the
linear mapping if appropriate.
Unfortunately, the bounds checking means that PKMAP_ADDR(LAST_PKMAP) is
incorrectly treated as a highmem address and we can end up walking off
the end of pkmap_page_table and subsequently passing junk to pte_page.
This patch fixes the bound check to stay within the pkmap tables.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 3e030ecc0fc7de10fd0da10c1c19939872a31717 upstream.
When a memory error happens on an in-use page or (free and in-use)
hugepage, the victim page is isolated with its refcount set to one.
When you try to unpoison it later, unpoison_memory() calls put_page()
for it twice in order to bring the page back to free page pool (buddy or
free hugepage list). However, if another memory error occurs on the
page which we are unpoisoning, memory_failure() returns without
releasing the refcount which was incremented in the same call at first,
which results in memory leak and unconsistent num_poisoned_pages
statistics. This patch fixes it.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: s/num_poisoned_pages/bad_mce_pages/]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
hugepage
commit b985194c8c0a130ed155b71662e39f7eaea4876f upstream.
For handling a free hugepage in memory failure, the race will happen if
another thread hwpoisoned this hugepage concurrently. So we need to
check PageHWPoison instead of !PageHWPoison.
If hwpoison_filter(p) returns true or a race happens, then we need to
unlock_page(hpage).
Signed-off-by: Chen Yucong <slaoub@gmail.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Tested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: s/num_poisoned_pages/bad_mce_pages/]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit d5c9fde3dae750889168807038243ff36431d276 upstream.
It is possible for "limit - setpoint + 1" to equal zero, after getting
truncated to a 32 bit variable, and resulting in a divide by zero error.
Using the fully 64 bit divide functions avoids this problem. It also
will cause pos_ratio_polynom() to return the correct value when
(setpoint - limit) exceeds 2^32.
Also uninline pos_ratio_polynom, at Andrew's request.
Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2:
Adjust context - pos_ratio_polynom() is not a separate function]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit ed84825b785ceb932af7dd5aa08614801721320b upstream.
In bdi_position_ratio(), get difference (setpoint-dirty) right even when
negative. Both setpoint and dirty are unsigned long, the difference was
zero-padded thus wrongly sign-extended to s64. This issue affects all
32-bit architectures, does not affect 64-bit architectures where long
and s64 are equivalent.
In this function, dirty is between freerun and limit, the pseudo-float x
is between [-1,1], expected to be negative about half the time. With
zero-padding, instead of a small negative x we obtained a large positive
one so bdi_position_ratio() returned garbage.
Casting the difference to s64 also prevents overflow with left-shift;
though normally these numbers are small and I never observed a 32-bit
overflow there.
(This patch does not solve the PAE OOM issue.)
Paul Szabo psz@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics University of Sydney Australia
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Paul Szabo <psz@maths.usyd.edu.au>
Reference: http://bugs.debian.org/695182
Signed-off-by: Paul Szabo <psz@maths.usyd.edu.au>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 1b17844b29ae042576bea588164f2f1e9590a8bc upstream.
fixup_user_fault() is used by the futex code when the direct user access
fails, and the futex code wants it to either map in the page in a usable
form or return an error. It relied on handle_mm_fault() to map the
page, and correctly checked the error return from that, but while that
does map the page, it doesn't actually guarantee that the page will be
mapped with sufficient permissions to be then accessed.
So do the appropriate tests of the vma access rights by hand.
[ Side note: arguably handle_mm_fault() could just do that itself, but
we have traditionally done it in the caller, because some callers -
notably get_user_pages() - have been able to access pages even when
they are mapped with PROT_NONE. Maybe we should re-visit that design
decision, but in the meantime this is the minimal patch. ]
Found by Dave Jones running his trinity tool.
Reported-by: Dave Jones <davej@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 7848a4bf51b34f41fcc9bd77e837126d99ae84e3 upstream.
soft lockup in freeing gigantic hugepage fixed in commit 55f67141a892 "mm:
hugetlb: fix softlockup when a large number of hugepages are freed." can
happen in return_unused_surplus_pages(), so let's fix it.
Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 55f67141a8927b2be3e51840da37b8a2320143ed upstream.
When I decrease the value of nr_hugepage in procfs a lot, softlockup
happens. It is because there is no chance of context switch during this
process.
On the other hand, when I allocate a large number of hugepages, there is
some chance of context switch. Hence softlockup doesn't happen during
this process. So it's necessary to add the context switch in the
freeing process as same as allocating process to avoid softlockup.
When I freed 12 TB hugapages with kernel-2.6.32-358.el6, the freeing
process occupied a CPU over 150 seconds and following softlockup message
appeared twice or more.
$ echo 6000000 > /proc/sys/vm/nr_hugepages
$ cat /proc/sys/vm/nr_hugepages
6000000
$ grep ^Huge /proc/meminfo
HugePages_Total: 6000000
HugePages_Free: 6000000
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
$ echo 0 > /proc/sys/vm/nr_hugepages
BUG: soft lockup - CPU#16 stuck for 67s! [sh:12883] ...
Pid: 12883, comm: sh Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
free_pool_huge_page+0xb8/0xd0
set_max_huge_pages+0x128/0x190
hugetlb_sysctl_handler_common+0x113/0x140
hugetlb_sysctl_handler+0x1e/0x20
proc_sys_call_handler+0x97/0xd0
proc_sys_write+0x14/0x20
vfs_write+0xb8/0x1a0
sys_write+0x51/0x90
__audit_syscall_exit+0x265/0x290
system_call_fastpath+0x16/0x1b
I have not confirmed this problem with upstream kernels because I am not
able to prepare the machine equipped with 12TB memory now. However I
confirmed that the amount of decreasing hugepages was directly
proportional to the amount of required time.
I measured required times on a smaller machine. It showed 130-145
hugepages decreased in a millisecond.
Amount of decreasing Required time Decreasing rate
hugepages (msec) (pages/msec)
------------------------------------------------------------
10,000 pages == 20GB 70 - 74 135-142
30,000 pages == 60GB 208 - 229 131-144
It means decrement of 6TB hugepages will trigger softlockup with the
default threshold 20sec, in this decreasing rate.
Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 57e68e9cd65b4b8eb4045a1e0d0746458502554c upstream.
A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin
fuzzing with trinity. The call site try_to_unmap_cluster() does not lock
the pages other than its check_page parameter (which is already locked).
The BUG_ON in mlock_vma_page() is not documented and its purpose is
somewhat unclear, but apparently it serializes against page migration,
which could otherwise fail to transfer the PG_mlocked flag. This would
not be fatal, as the page would be eventually encountered again, but
NR_MLOCK accounting would become distorted nevertheless. This patch adds
a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that
effect.
The call site try_to_unmap_cluster() is fixed so that for page !=
check_page, trylock_page() is attempted (to avoid possible deadlocks as we
already have check_page locked) and mlock_vma_page() is performed only
upon success. If the page lock cannot be obtained, the page is left
without PG_mlocked, which is again not a problem in the whole unevictable
memory design.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 98398c32f6687ee1e1f3ae084effb4b75adb0747 upstream.
In __page_check_address(), if address's pud is not present,
huge_pte_offset() will return NULL, we should check the return value.
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: qiuxishi <qiuxishi@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
spin_lock_irq()
commit a85d9df1ea1d23682a0ed1e100e6965006595d06 upstream.
During aio stress test, we observed the following lockdep warning. This
mean AIO+numa_balancing is currently deadlockable.
The problem is, aio_migratepage disable interrupt, but
__set_page_dirty_nobuffers unintentionally enable it again.
Generally, all helper function should use spin_lock_irqsave() instead of
spin_lock_irq() because they don't know caller at all.
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&ctx->completion_lock)->rlock);
<Interrupt>
lock(&(&ctx->completion_lock)->rlock);
*** DEADLOCK ***
dump_stack+0x19/0x1b
print_usage_bug+0x1f7/0x208
mark_lock+0x21d/0x2a0
mark_held_locks+0xb9/0x140
trace_hardirqs_on_caller+0x105/0x1d0
trace_hardirqs_on+0xd/0x10
_raw_spin_unlock_irq+0x2c/0x50
__set_page_dirty_nobuffers+0x8c/0xf0
migrate_page_copy+0x434/0x540
aio_migratepage+0xb1/0x140
move_to_new_page+0x7d/0x230
migrate_pages+0x5e5/0x700
migrate_misplaced_page+0xbc/0xf0
do_numa_page+0x102/0x190
handle_pte_fault+0x241/0x970
handle_mm_fault+0x265/0x370
__do_page_fault+0x172/0x5a0
do_page_fault+0x1a/0x70
page_fault+0x28/0x30
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit f893ab41e4dae2fe8991faf5d86d029068d1ef3a upstream.
swapoff clear swap_info's SWP_USED flag prematurely and free its
resources after that. A concurrent swapon will reuse this swap_info
while its previous resources are not cleared completely.
These late freed resources are:
- p->percpu_cluster
- swap_cgroup_ctrl[type]
- block_device setting
- inode->i_flags &= ~S_SWAPFILE
This patch clears the SWP_USED flag after all its resources are freed,
so that swapon can reuse this swap_info by alloc_swap_info() safely.
[akpm@linux-foundation.org: tidy up code comment]
Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 778c14affaf94a9e4953179d3e13a544ccce7707 upstream.
A 3% of system memory bonus is sometimes too excessive in comparison to
other processes.
With commit a63d83f427fb ("oom: badness heuristic rewrite"), the OOM
killer tries to avoid killing privileged tasks by subtracting 3% of
overall memory (system or cgroup) from their per-task consumption. But
as a result, all root tasks that consume less than 3% of overall memory
are considered equal, and so it only takes 33+ privileged tasks pushing
the system out of memory for the OOM killer to do something stupid and
kill dhclient or other root-owned processes. For example, on a 32G
machine it can't tell the difference between the 1M agetty and the 10G
fork bomb member.
The changelog describes this 3% boost as the equivalent to the global
overcommit limit being 3% higher for privileged tasks, but this is not
the same as discounting 3% of overall memory from _every privileged task
individually_ during OOM selection.
Replace the 3% of system memory bonus with a 3% of current memory usage
bonus.
By giving root tasks a bonus that is proportional to their actual size,
they remain comparable even when relatively small. In the example
above, the OOM killer will discount the 1M agetty's 256 badness points
down to 179, and the 10G fork bomb's 262144 points down to 183500 points
and make the right choice, instead of discounting both to 0 and killing
agetty because it's first in the task list.
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: existing code changes 'points' directly rather
than using 'adj' variable]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 8afb1474db4701d1ab80cd8251137a3260e6913e upstream.
/sys/kernel/slab/:t-0000048 # cat cpu_slabs
231 N0=16 N1=215
/sys/kernel/slab/:t-0000048 # cat slabs
145 N0=36 N1=109
See, the number of slabs is smaller than that of cpu slabs.
The bug was introduced by commit 49e2258586b423684f03c278149ab46d8f8b6700
("slub: per cpu cache for partial pages").
We should use page->pages instead of page->pobjects when calculating
the number of cpu partial slabs. This also fixes the mapping of slabs
and nodes.
As there's no variable storing the number of total/active objects in
cpu partial slabs, and we don't have user interfaces requiring those
statistics, I just add WARN_ON for those cases.
Acked-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 27c73ae759774e63313c1fbfeb17ba076cea64c5 upstream.
Commit 7cb2ef56e6a8 ("mm: fix aio performance regression for database
caused by THP") can cause dereference of a dangling pointer if
split_huge_page runs during PageHuge() if there are updates to the
tail_page->private field.
Also it is repeating compound_head twice for hugetlbfs and it is running
compound_head+compound_trans_head for THP when a single one is needed in
both cases.
The new code within the PageSlab() check doesn't need to verify that the
THP page size is never bigger than the smallest hugetlbfs page size, to
avoid memory corruption.
A longstanding theoretical race condition was found while fixing the
above (see the change right after the skip_unlock label, that is
relevant for the compound_lock path too).
By re-establishing the _mapcount tail refcounting for all compound
pages, this also fixes the below problem:
echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
BUG: Bad page state in process bash pfn:59a01
page:ffffea000139b038 count:0 mapcount:10 mapping: (null) index:0x0
page flags: 0x1c00000000008000(tail)
Modules linked in:
CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x55/0x76
bad_page+0xd5/0x130
free_pages_prepare+0x213/0x280
__free_pages+0x36/0x80
update_and_free_page+0xc1/0xd0
free_pool_huge_page+0xc2/0xe0
set_max_huge_pages.part.58+0x14c/0x220
nr_hugepages_store_common.isra.60+0xd0/0xf0
nr_hugepages_store+0x13/0x20
kobj_attr_store+0xf/0x20
sysfs_write_file+0x189/0x1e0
vfs_write+0xc5/0x1f0
SyS_write+0x55/0xb0
system_call_fastpath+0x16/0x1b
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[Khalid Aziz: Backported to 3.4]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 7cb2ef56e6a8b7b368b2e883a0a47d02fed66911 upstream.
I am working with a tool that simulates oracle database I/O workload.
This tool (orion to be specific -
<http://docs.oracle.com/cd/E11882_01/server.112/e16638/iodesign.htm#autoId24>)
allocates hugetlbfs pages using shmget() with SHM_HUGETLB flag. It then
does aio into these pages from flash disks using various common block
sizes used by database. I am looking at performance with two of the most
common block sizes - 1M and 64K. aio performance with these two block
sizes plunged after Transparent HugePages was introduced in the kernel.
Here are performance numbers:
pre-THP 2.6.39 3.11-rc5
1M read 8384 MB/s 5629 MB/s 6501 MB/s
64K read 7867 MB/s 4576 MB/s 4251 MB/s
I have narrowed the performance impact down to the overheads introduced by
THP in __get_page_tail() and put_compound_page() routines. perf top shows
>40% of cycles being spent in these two routines. Every time direct I/O
to hugetlbfs pages starts, kernel calls get_page() to grab a reference to
the pages and calls put_page() when I/O completes to put the reference
away. THP introduced significant amount of locking overhead to get_page()
and put_page() when dealing with compound pages because hugepages can be
split underneath get_page() and put_page(). It added this overhead
irrespective of whether it is dealing with hugetlbfs pages or transparent
hugepages. This resulted in 20%-45% drop in aio performance when using
hugetlbfs pages.
Since hugetlbfs pages can not be split, there is no reason to go through
all the locking overhead for these pages from what I can see. I added
code to __get_page_tail() and put_compound_page() to bypass all the
locking code when working with hugetlbfs pages. This improved performance
significantly. Performance numbers with this patch:
pre-THP 3.11-rc5 3.11-rc5 + Patch
1M read 8384 MB/s 6501 MB/s 8371 MB/s
64K read 7867 MB/s 4251 MB/s 6510 MB/s
Performance with 64K read is still lower than what it was before THP, but
still a 53% improvement. It does mean there is more work to be done but I
will take a 53% improvement for now.
Please take a look at the following patch and let me know if it looks
reasonable.
[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 2afc745f3e3079ab16c826be4860da2529054dd2 upstream.
This patch fixes the problem that get_unmapped_area() can return illegal
address and result in failing mmap(2) etc.
In case that the address higher than PAGE_SIZE is set to
/proc/sys/vm/mmap_min_addr, the address lower than mmap_min_addr can be
returned by get_unmapped_area(), even if you do not pass any virtual
address hint (i.e. the second argument).
This is because the current get_unmapped_area() code does not take into
account mmap_min_addr.
This leads to two actual problems as follows:
1. mmap(2) can fail with EPERM on the process without CAP_SYS_RAWIO,
although any illegal parameter is not passed.
2. The bottom-up search path after the top-down search might not work in
arch_get_unmapped_area_topdown().
Note: The first and third chunk of my patch, which changes "len" check,
are for more precise check using mmap_min_addr, and not for solving the
above problem.
[How to reproduce]
--- test.c -------------------------------------------------
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/errno.h>
int main(int argc, char *argv[])
{
void *ret = NULL, *last_map;
size_t pagesize = sysconf(_SC_PAGESIZE);
do {
last_map = ret;
ret = mmap(0, pagesize, PROT_NONE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// printf("ret=%p\n", ret);
} while (ret != MAP_FAILED);
if (errno != ENOMEM) {
printf("ERR: unexpected errno: %d (last map=%p)\n",
errno, last_map);
}
return 0;
}
---------------------------------------------------------------
$ gcc -m32 -o test test.c
$ sudo sysctl -w vm.mmap_min_addr=65536
vm.mmap_min_addr = 65536
$ ./test (run as non-priviledge user)
ERR: unexpected errno: 1 (last map=0x10000)
Signed-off-by: Akira Takeuchi <takeuchi.akr@jp.panasonic.com>
Signed-off-by: Kiyoshi Owada <owada.kiyoshi@jp.panasonic.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2:
As we do not have vm_unmapped_area(), make arch_get_unmapped_area_topdown()
calculate the lower limit for the new area's end address and then compare
addresses with this instead of with len. In the process, fix an off-by-one
error which could result in returning 0 if mm->mmap_base == len.]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|