Age | Commit message (Collapse) | Author |
|
Linux 3.1.10
Change-Id: I465d184c492e8041dd0cd90f2cb70fde17ba7118
Signed-off-by: Varun Wadekar <vwadekar@nvidia.com>
|
|
commit ab936cbcd02072a34b60d268f94440fd5cf1970b upstream.
Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a
function replace_page_cache_page(). This function replaces a page in the
radix-tree with a new page. WHen doing this, memory cgroup needs to fix
up the accounting information. memcg need to check PCG_USED bit etc.
In some(many?) cases, 'newpage' is on LRU before calling
replace_page_cache(). So, memcg's LRU accounting information should be
fixed, too.
This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
In that function, old pages will be unaccounted without touching
res_counter and new page will be accounted to the memcg (of old page).
WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
races with LRU handling.
Background:
replace_page_cache_page() is called by FUSE code in its splice() handling.
Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
because rmdir() checks the whole LRU is empty and there is no account leak.
If a page is on the other LRU than it should be, rmdir() will fail.
This bug was added in March 2011, but no bug report yet. I guess there
are not many people who use memcg and FUSE at the same time with upstream
kernels.
The result of this bug is that admin cannot destroy a memcg because of
account leak. So, no panic, no deadlock. And, even if an active cgroup
exist, umount can succseed. So no problem at shutdown.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 73736e0387ba0e6d2b703407b4d26168d31516a7 upstream.
Zhihua Che reported a possible memleak in slub allocator on
CONFIG_PREEMPT=y builds.
It is possible current thread migrates right before disabling irqs in
__slab_alloc(). We must check again c->freelist, and perform a normal
allocation instead of scratching c->freelist.
Many thanks to Zhihua Che for spotting this bug, introduced in 2.6.39
V2: Its also possible an IRQ freed one (or several) object(s) and
populated c->freelist, so its not a CONFIG_PREEMPT only problem.
Reported-by: Zhihua Che <zhihua.che@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit e26a51148f3ebd859bca8bf2e0f212839b447f62 upstream.
commit 8aacc9f550 ("mm/mempolicy.c: fix pgoff in mbind vma merge") is the
slightly incorrect fix.
Why? Think following case.
1. map 4 pages of a file at offset 0
[0123]
2. map 2 pages just after the first mapping of the same file but with
page offset 2
[0123][23]
3. mbind() 2 pages from the first mapping at offset 2.
mbind_range() should treat new vma is,
[0123][23]
|23|
mbind vma
but it does
[0123][23]
|01|
mbind vma
Oops. then, it makes wrong vma merge and splitting ([01][0123] or similar).
This patch fixes it.
[testcase]
test result - before the patch
case4: 126: test failed. expect '2,4', actual '2,2,2'
case5: passed
case6: passed
case7: passed
case8: passed
case_n: 246: test failed. expect '4,2', actual '1,4'
------------[ cut here ]------------
kernel BUG at mm/filemap.c:135!
invalid opcode: 0000 [#4] SMP DEBUG_PAGEALLOC
(snip long bug on messages)
test result - after the patch
case4: passed
case5: passed
case6: passed
case7: passed
case8: passed
case_n: passed
source: mbind_vma_test.c
============================================================
#include <numaif.h>
#include <numa.h>
#include <sys/mman.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
static unsigned long pagesize;
void* mmap_addr;
struct bitmask *nmask;
char buf[1024];
FILE *file;
char retbuf[10240] = "";
int mapped_fd;
char *rubysrc = "ruby -e '\
pid = %d; \
vstart = 0x%llx; \
vend = 0x%llx; \
s = `pmap -q #{pid}`; \
rary = []; \
s.each_line {|line|; \
ary=line.split(\" \"); \
addr = ary[0].to_i(16); \
if(vstart <= addr && addr < vend) then \
rary.push(ary[1].to_i()/4); \
end; \
}; \
print rary.join(\",\"); \
'";
void init(void)
{
void* addr;
char buf[128];
nmask = numa_allocate_nodemask();
numa_bitmask_setbit(nmask, 0);
pagesize = getpagesize();
sprintf(buf, "%s", "mbind_vma_XXXXXX");
mapped_fd = mkstemp(buf);
if (mapped_fd == -1)
perror("mkstemp "), exit(1);
unlink(buf);
if (lseek(mapped_fd, pagesize*8, SEEK_SET) < 0)
perror("lseek "), exit(1);
if (write(mapped_fd, "\0", 1) < 0)
perror("write "), exit(1);
addr = mmap(NULL, pagesize*8, PROT_NONE,
MAP_SHARED, mapped_fd, 0);
if (addr == MAP_FAILED)
perror("mmap "), exit(1);
if (mprotect(addr+pagesize, pagesize*6, PROT_READ|PROT_WRITE) < 0)
perror("mprotect "), exit(1);
mmap_addr = addr + pagesize;
/* make page populate */
memset(mmap_addr, 0, pagesize*6);
}
void fin(void)
{
void* addr = mmap_addr - pagesize;
munmap(addr, pagesize*8);
memset(buf, 0, sizeof(buf));
memset(retbuf, 0, sizeof(retbuf));
}
void mem_bind(int index, int len)
{
int err;
err = mbind(mmap_addr+pagesize*index, pagesize*len,
MPOL_BIND, nmask->maskp, nmask->size, 0);
if (err)
perror("mbind "), exit(err);
}
void mem_interleave(int index, int len)
{
int err;
err = mbind(mmap_addr+pagesize*index, pagesize*len,
MPOL_INTERLEAVE, nmask->maskp, nmask->size, 0);
if (err)
perror("mbind "), exit(err);
}
void mem_unbind(int index, int len)
{
int err;
err = mbind(mmap_addr+pagesize*index, pagesize*len,
MPOL_DEFAULT, NULL, 0, 0);
if (err)
perror("mbind "), exit(err);
}
void Assert(char *expected, char *value, char *name, int line)
{
if (strcmp(expected, value) == 0) {
fprintf(stderr, "%s: passed\n", name);
return;
}
else {
fprintf(stderr, "%s: %d: test failed. expect '%s', actual '%s'\n",
name, line,
expected, value);
// exit(1);
}
}
/*
AAAA
PPPPPPNNNNNN
might become
PPNNNNNNNNNN
case 4 below
*/
void case4(void)
{
init();
sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
mem_bind(0, 4);
mem_unbind(2, 2);
file = popen(buf, "r");
fread(retbuf, sizeof(retbuf), 1, file);
Assert("2,4", retbuf, "case4", __LINE__);
fin();
}
/*
AAAA
PPPPPPNNNNNN
might become
PPPPPPPPPPNN
case 5 below
*/
void case5(void)
{
init();
sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
mem_bind(0, 2);
mem_bind(2, 2);
file = popen(buf, "r");
fread(retbuf, sizeof(retbuf), 1, file);
Assert("4,2", retbuf, "case5", __LINE__);
fin();
}
/*
AAAA
PPPPNNNNXXXX
might become
PPPPPPPPPPPP 6
*/
void case6(void)
{
init();
sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
mem_bind(0, 2);
mem_bind(4, 2);
mem_bind(2, 2);
file = popen(buf, "r");
fread(retbuf, sizeof(retbuf), 1, file);
Assert("6", retbuf, "case6", __LINE__);
fin();
}
/*
AAAA
PPPPNNNNXXXX
might become
PPPPPPPPXXXX 7
*/
void case7(void)
{
init();
sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
mem_bind(0, 2);
mem_interleave(4, 2);
mem_bind(2, 2);
file = popen(buf, "r");
fread(retbuf, sizeof(retbuf), 1, file);
Assert("4,2", retbuf, "case7", __LINE__);
fin();
}
/*
AAAA
PPPPNNNNXXXX
might become
PPPPNNNNNNNN 8
*/
void case8(void)
{
init();
sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
mem_bind(0, 2);
mem_interleave(4, 2);
mem_interleave(2, 2);
file = popen(buf, "r");
fread(retbuf, sizeof(retbuf), 1, file);
Assert("2,4", retbuf, "case8", __LINE__);
fin();
}
void case_n(void)
{
init();
sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
/* make redundunt mappings [0][1234][34][7] */
mmap(mmap_addr + pagesize*4, pagesize*2, PROT_READ|PROT_WRITE,
MAP_FIXED|MAP_SHARED, mapped_fd, pagesize*3);
/* Expect to do nothing. */
mem_unbind(2, 2);
file = popen(buf, "r");
fread(retbuf, sizeof(retbuf), 1, file);
Assert("4,2", retbuf, "case_n", __LINE__);
fin();
}
int main(int argc, char** argv)
{
case4();
case5();
case6();
case7();
case8();
case_n();
return 0;
}
=============================================================
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Caspar Zhang <caspar@casparzhang.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Change-Id: I43dc85715b04c36ba118d6e63c96786c634e2eea
Reviewed-on: http://git-master/r/74202
Reviewed-by: Varun Wadekar <vwadekar@nvidia.com>
Tested-by: Varun Wadekar <vwadekar@nvidia.com>
|
|
commit b0365c8d0cb6e79eb5f21418ae61ab511f31b575 upstream.
If a huge page is enqueued under the protection of hugetlb_lock, then the
operation is atomic and safe.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Change-Id: I82b34c64333393dc83d2c892114c9fb5cf432ea4
Reviewed-on: http://git-master/r/74201
Reviewed-by: Varun Wadekar <vwadekar@nvidia.com>
Tested-by: Varun Wadekar <vwadekar@nvidia.com>
|
|
commit a41c58a6665cc995e237303b05db42100b71b65e upstream.
If the request is to create non-root group and we fail to meet it, we
should leave the root unchanged.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Change-Id: Ib38fb531508f5c250e9e52f6dc3432db32c315ad
Reviewed-on: http://git-master/r/74192
Reviewed-by: Varun Wadekar <vwadekar@nvidia.com>
Tested-by: Varun Wadekar <vwadekar@nvidia.com>
|
|
commit e6f67b8c05f5e129e126f4409ddac6f25f58ffcb upstream.
lockdep reports a deadlock in jfs because a special inode's rw semaphore
is taken recursively. The mapping's gfp mask is GFP_NOFS, but is not
used when __read_cache_page() calls add_to_page_cache_lru().
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Change-Id: I200c86387cc650dbfb33c8a22ae943ed6996648a
Reviewed-on: http://git-master/r/74187
Reviewed-by: Varun Wadekar <vwadekar@nvidia.com>
Tested-by: Varun Wadekar <vwadekar@nvidia.com>
|
|
commit ff05b6f7ae762b6eb464183eec994b28ea09f6dd upstream.
An integer overflow will happen on 64bit archs if task's sum of rss,
swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by
commit
f755a04 oom: use pte pages in OOM score
where the oom score computation was divided into several steps and it's no
longer computed as one expression in unsigned long(rss, swapents, nr_pte
are unsigned long), where the result value assigned to points(int) is in
range(1..1000). So there could be an int overflow while computing
176 points *= 1000;
and points may have negative value. Meaning the oom score for a mem hog task
will be one.
196 if (points <= 0)
197 return 1;
For example:
[ 3366] 0 3366 35390480 24303939 5 0 0 oom01
Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child
Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical
memory, but it's oom score is one.
In this situation the mem hog task is skipped and oom killer kills another and
most probably innocent task with oom score greater than one.
The points variable should be of type long instead of int to prevent the
int overflow.
Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Change-Id: I56c6a8a4aadca809e04276eabe5552935c51387f
Reviewed-on: http://git-master/r/74176
Reviewed-by: Varun Wadekar <vwadekar@nvidia.com>
Tested-by: Varun Wadekar <vwadekar@nvidia.com>
|
|
commit 9f57bd4d6dc69a4e3bf43044fa00fcd24dd363e3 upstream.
per_cpu_ptr_to_phys() incorrectly rounds up its result for non-kmalloc
case to the page boundary, which is bogus for any non-page-aligned
address.
This affects the only in-tree user of this function - sysfs handler
for per-cpu 'crash_notes' physical address. The trouble is that the
crash_notes per-cpu variable is not page-aligned:
crash_notes = 0xc08e8ed4
PER-CPU OFFSET VALUES:
CPU 0: 3711f000
CPU 1: 37129000
CPU 2: 37133000
CPU 3: 3713d000
So, the per-cpu addresses are:
crash_notes on CPU 0: f7a07ed4 => phys 36b57ed4
crash_notes on CPU 1: f7a11ed4 => phys 36b4ded4
crash_notes on CPU 2: f7a1bed4 => phys 36b43ed4
crash_notes on CPU 3: f7a25ed4 => phys 36b39ed4
However, /sys/devices/system/cpu/cpu*/crash_notes says:
/sys/devices/system/cpu/cpu0/crash_notes: 36b57000
/sys/devices/system/cpu/cpu1/crash_notes: 36b4d000
/sys/devices/system/cpu/cpu2/crash_notes: 36b43000
/sys/devices/system/cpu/cpu3/crash_notes: 36b39000
As you can see, all values are rounded down to a page
boundary. Consequently, this is where kexec sets up the NOTE segments,
and thus where the secondary kernel is looking for them. However, when
the first kernel crashes, it saves the notes to the unaligned
addresses, where they are not found.
Fix it by adding offset_in_page() to the translated page address.
-tj: Combined Eugene's and Petr's commit messages.
Signed-off-by: Eugene Surovegin <ebs@ebshome.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Change-Id: I2c141bbe1ddfb7f91749af4411f884125ea6e14e
Reviewed-on: http://git-master/r/74173
Reviewed-by: Varun Wadekar <vwadekar@nvidia.com>
Tested-by: Varun Wadekar <vwadekar@nvidia.com>
|
|
commit e26a51148f3ebd859bca8bf2e0f212839b447f62 upstream.
commit 8aacc9f550 ("mm/mempolicy.c: fix pgoff in mbind vma merge") is the
slightly incorrect fix.
Why? Think following case.
1. map 4 pages of a file at offset 0
[0123]
2. map 2 pages just after the first mapping of the same file but with
page offset 2
[0123][23]
3. mbind() 2 pages from the first mapping at offset 2.
mbind_range() should treat new vma is,
[0123][23]
|23|
mbind vma
but it does
[0123][23]
|01|
mbind vma
Oops. then, it makes wrong vma merge and splitting ([01][0123] or similar).
This patch fixes it.
[testcase]
test result - before the patch
case4: 126: test failed. expect '2,4', actual '2,2,2'
case5: passed
case6: passed
case7: passed
case8: passed
case_n: 246: test failed. expect '4,2', actual '1,4'
------------[ cut here ]------------
kernel BUG at mm/filemap.c:135!
invalid opcode: 0000 [#4] SMP DEBUG_PAGEALLOC
(snip long bug on messages)
test result - after the patch
case4: passed
case5: passed
case6: passed
case7: passed
case8: passed
case_n: passed
source: mbind_vma_test.c
============================================================
#include <numaif.h>
#include <numa.h>
#include <sys/mman.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
static unsigned long pagesize;
void* mmap_addr;
struct bitmask *nmask;
char buf[1024];
FILE *file;
char retbuf[10240] = "";
int mapped_fd;
char *rubysrc = "ruby -e '\
pid = %d; \
vstart = 0x%llx; \
vend = 0x%llx; \
s = `pmap -q #{pid}`; \
rary = []; \
s.each_line {|line|; \
ary=line.split(\" \"); \
addr = ary[0].to_i(16); \
if(vstart <= addr && addr < vend) then \
rary.push(ary[1].to_i()/4); \
end; \
}; \
print rary.join(\",\"); \
'";
void init(void)
{
void* addr;
char buf[128];
nmask = numa_allocate_nodemask();
numa_bitmask_setbit(nmask, 0);
pagesize = getpagesize();
sprintf(buf, "%s", "mbind_vma_XXXXXX");
mapped_fd = mkstemp(buf);
if (mapped_fd == -1)
perror("mkstemp "), exit(1);
unlink(buf);
if (lseek(mapped_fd, pagesize*8, SEEK_SET) < 0)
perror("lseek "), exit(1);
if (write(mapped_fd, "\0", 1) < 0)
perror("write "), exit(1);
addr = mmap(NULL, pagesize*8, PROT_NONE,
MAP_SHARED, mapped_fd, 0);
if (addr == MAP_FAILED)
perror("mmap "), exit(1);
if (mprotect(addr+pagesize, pagesize*6, PROT_READ|PROT_WRITE) < 0)
perror("mprotect "), exit(1);
mmap_addr = addr + pagesize;
/* make page populate */
memset(mmap_addr, 0, pagesize*6);
}
void fin(void)
{
void* addr = mmap_addr - pagesize;
munmap(addr, pagesize*8);
memset(buf, 0, sizeof(buf));
memset(retbuf, 0, sizeof(retbuf));
}
void mem_bind(int index, int len)
{
int err;
err = mbind(mmap_addr+pagesize*index, pagesize*len,
MPOL_BIND, nmask->maskp, nmask->size, 0);
if (err)
perror("mbind "), exit(err);
}
void mem_interleave(int index, int len)
{
int err;
err = mbind(mmap_addr+pagesize*index, pagesize*len,
MPOL_INTERLEAVE, nmask->maskp, nmask->size, 0);
if (err)
perror("mbind "), exit(err);
}
void mem_unbind(int index, int len)
{
int err;
err = mbind(mmap_addr+pagesize*index, pagesize*len,
MPOL_DEFAULT, NULL, 0, 0);
if (err)
perror("mbind "), exit(err);
}
void Assert(char *expected, char *value, char *name, int line)
{
if (strcmp(expected, value) == 0) {
fprintf(stderr, "%s: passed\n", name);
return;
}
else {
fprintf(stderr, "%s: %d: test failed. expect '%s', actual '%s'\n",
name, line,
expected, value);
// exit(1);
}
}
/*
AAAA
PPPPPPNNNNNN
might become
PPNNNNNNNNNN
case 4 below
*/
void case4(void)
{
init();
sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
mem_bind(0, 4);
mem_unbind(2, 2);
file = popen(buf, "r");
fread(retbuf, sizeof(retbuf), 1, file);
Assert("2,4", retbuf, "case4", __LINE__);
fin();
}
/*
AAAA
PPPPPPNNNNNN
might become
PPPPPPPPPPNN
case 5 below
*/
void case5(void)
{
init();
sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
mem_bind(0, 2);
mem_bind(2, 2);
file = popen(buf, "r");
fread(retbuf, sizeof(retbuf), 1, file);
Assert("4,2", retbuf, "case5", __LINE__);
fin();
}
/*
AAAA
PPPPNNNNXXXX
might become
PPPPPPPPPPPP 6
*/
void case6(void)
{
init();
sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
mem_bind(0, 2);
mem_bind(4, 2);
mem_bind(2, 2);
file = popen(buf, "r");
fread(retbuf, sizeof(retbuf), 1, file);
Assert("6", retbuf, "case6", __LINE__);
fin();
}
/*
AAAA
PPPPNNNNXXXX
might become
PPPPPPPPXXXX 7
*/
void case7(void)
{
init();
sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
mem_bind(0, 2);
mem_interleave(4, 2);
mem_bind(2, 2);
file = popen(buf, "r");
fread(retbuf, sizeof(retbuf), 1, file);
Assert("4,2", retbuf, "case7", __LINE__);
fin();
}
/*
AAAA
PPPPNNNNXXXX
might become
PPPPNNNNNNNN 8
*/
void case8(void)
{
init();
sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
mem_bind(0, 2);
mem_interleave(4, 2);
mem_interleave(2, 2);
file = popen(buf, "r");
fread(retbuf, sizeof(retbuf), 1, file);
Assert("2,4", retbuf, "case8", __LINE__);
fin();
}
void case_n(void)
{
init();
sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
/* make redundunt mappings [0][1234][34][7] */
mmap(mmap_addr + pagesize*4, pagesize*2, PROT_READ|PROT_WRITE,
MAP_FIXED|MAP_SHARED, mapped_fd, pagesize*3);
/* Expect to do nothing. */
mem_unbind(2, 2);
file = popen(buf, "r");
fread(retbuf, sizeof(retbuf), 1, file);
Assert("4,2", retbuf, "case_n", __LINE__);
fin();
}
int main(int argc, char** argv)
{
case4();
case5();
case6();
case7();
case8();
case_n();
return 0;
}
=============================================================
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Caspar Zhang <caspar@casparzhang.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit b0365c8d0cb6e79eb5f21418ae61ab511f31b575 upstream.
If a huge page is enqueued under the protection of hugetlb_lock, then the
operation is atomic and safe.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit a41c58a6665cc995e237303b05db42100b71b65e upstream.
If the request is to create non-root group and we fail to meet it, we
should leave the root unchanged.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit e6f67b8c05f5e129e126f4409ddac6f25f58ffcb upstream.
lockdep reports a deadlock in jfs because a special inode's rw semaphore
is taken recursively. The mapping's gfp mask is GFP_NOFS, but is not
used when __read_cache_page() calls add_to_page_cache_lru().
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit ff05b6f7ae762b6eb464183eec994b28ea09f6dd upstream.
An integer overflow will happen on 64bit archs if task's sum of rss,
swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by
commit
f755a04 oom: use pte pages in OOM score
where the oom score computation was divided into several steps and it's no
longer computed as one expression in unsigned long(rss, swapents, nr_pte
are unsigned long), where the result value assigned to points(int) is in
range(1..1000). So there could be an int overflow while computing
176 points *= 1000;
and points may have negative value. Meaning the oom score for a mem hog task
will be one.
196 if (points <= 0)
197 return 1;
For example:
[ 3366] 0 3366 35390480 24303939 5 0 0 oom01
Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child
Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical
memory, but it's oom score is one.
In this situation the mem hog task is skipped and oom killer kills another and
most probably innocent task with oom score greater than one.
The points variable should be of type long instead of int to prevent the
int overflow.
Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 9f57bd4d6dc69a4e3bf43044fa00fcd24dd363e3 upstream.
per_cpu_ptr_to_phys() incorrectly rounds up its result for non-kmalloc
case to the page boundary, which is bogus for any non-page-aligned
address.
This affects the only in-tree user of this function - sysfs handler
for per-cpu 'crash_notes' physical address. The trouble is that the
crash_notes per-cpu variable is not page-aligned:
crash_notes = 0xc08e8ed4
PER-CPU OFFSET VALUES:
CPU 0: 3711f000
CPU 1: 37129000
CPU 2: 37133000
CPU 3: 3713d000
So, the per-cpu addresses are:
crash_notes on CPU 0: f7a07ed4 => phys 36b57ed4
crash_notes on CPU 1: f7a11ed4 => phys 36b4ded4
crash_notes on CPU 2: f7a1bed4 => phys 36b43ed4
crash_notes on CPU 3: f7a25ed4 => phys 36b39ed4
However, /sys/devices/system/cpu/cpu*/crash_notes says:
/sys/devices/system/cpu/cpu0/crash_notes: 36b57000
/sys/devices/system/cpu/cpu1/crash_notes: 36b4d000
/sys/devices/system/cpu/cpu2/crash_notes: 36b43000
/sys/devices/system/cpu/cpu3/crash_notes: 36b39000
As you can see, all values are rounded down to a page
boundary. Consequently, this is where kexec sets up the NOTE segments,
and thus where the secondary kernel is looking for them. However, when
the first kernel crashes, it saves the notes to the unaligned
addresses, where they are not found.
Fix it by adding offset_in_page() to the translated page address.
-tj: Combined Eugene's and Petr's commit messages.
Signed-off-by: Eugene Surovegin <ebs@ebshome.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
Change-Id: I99507d7cfdcee064f808856dc2ce99d806fd864f
|
|
commit a855b84c3d8c73220d4d3cd392a7bee7c83de70e upstream.
Percpu allocator recorded the cpus which map to the first and last
units in pcpu_first/last_unit_cpu respectively and used them to
determine the address range of a chunk - e.g. it assumed that the
first unit has the lowest address in a chunk while the last unit has
the highest address.
This simply isn't true. Groups in a chunk can have arbitrary positive
or negative offsets from the previous one and there is no guarantee
that the first unit occupies the lowest offset while the last one the
highest.
Fix it by actually comparing unit offsets to determine cpus occupying
the lowest and highest offsets. Also, rename pcu_first/last_unit_cpu
to pcpu_low/high_unit_cpu to avoid confusion.
The chunk address range is used to flush cache on vmalloc area
map/unmap and decide whether a given address is in the first chunk by
per_cpu_ptr_to_phys() and the bug was discovered by invalid
per_cpu_ptr_to_phys() translation for crash_note.
Kudos to Dave Young for tracking down the problem.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: WANG Cong <xiyou.wangcong@gmail.com>
Reported-by: Dave Young <dyoung@redhat.com>
Tested-by: Dave Young <dyoung@redhat.com>
LKML-Reference: <4EC21F67.10905@redhat.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 1368edf0647ac112d8cfa6ce47257dc950c50f5c upstream.
Commit f5252e00 ("mm: avoid null pointer access in vm_struct via
/proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after
it is fully initialised. Unfortunately, it did not check that
__vmalloc_area_node() successfully populated the area. In the event of
allocation failure, the vmalloc area is freed but the pointer to freed
memory is inserted into the vmlist leading to a a crash later in
get_vmalloc_info().
This patch adds a check for ____vmalloc_area_node() failure within
__vmalloc_node_range. It does not use "goto fail" as in the previous
error path as a warning was already displayed by __vmalloc_area_node()
before it called vfree in its failure path.
Credit goes to Luciano Chavez for doing all the real work of identifying
exactly where the problem was.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Luciano Chavez <lnx1138@linux.vnet.ibm.com>
Tested-by: Luciano Chavez <lnx1138@linux.vnet.ibm.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
pageblocks
commit d021563888312018ca65681096f62e36c20e63cc upstream.
setup_zone_migrate_reserve() expects that zone->start_pfn starts at
pageblock_nr_pages aligned pfn otherwise we could access beyond an
existing memblock resulting in the following panic if
CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:
IP: [<c02d331d>] setup_zone_migrate_reserve+0xcd/0x180
*pdpt = 0000000000000000 *pde = f000ff53f000ff53
Oops: 0000 [#1] SMP
Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
EIP: 0060:[<c02d331d>] EFLAGS: 00010006 CPU: 0
EIP is at setup_zone_migrate_reserve+0xcd/0x180
EAX: 000c0000 EBX: f5801fc0 ECX: 000c0000 EDX: 00000000
ESI: 000c01fe EDI: 000c01fe EBP: 00140000 ESP: f2475f58
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process swapper (pid: 1, ti=f2474000 task=f2472cd0 task.ti=f2474000)
Call Trace:
[<c02d389c>] __setup_per_zone_wmarks+0xec/0x160
[<c02d3a1f>] setup_per_zone_wmarks+0xf/0x20
[<c08a771c>] init_per_zone_wmark_min+0x27/0x86
[<c020111b>] do_one_initcall+0x2b/0x160
[<c086639d>] kernel_init+0xbe/0x157
[<c05cae26>] kernel_thread_helper+0x6/0xd
Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 <2b> 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f
EIP: [<c02d331d>] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:f2475f58
CR2: 00000000000012b4
We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
highstart_pfn = 0x36ffe.
The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page
in a pageblock is reserved before marking it MIGRATE_RESERVE").
Make sure that start_pfn is always aligned to pageblock_nr_pages to
ensure that pfn_valid s always called at the start of each pageblock.
Architectures with holes in pageblocks will be correctly handled by
pfn_valid_within in pageblock_is_reserved.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Dang Bo <bdang@vmware.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Arve Hjnnevg <arve@android.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 58a84aa92723d1ac3e1cc4e3b0ff49291663f7e1 upstream.
Commit 70b50f94f1644 ("mm: thp: tail page refcounting fix") keeps all
page_tail->_count zero at all times. But the current kernel does not
set page_tail->_count to zero if a 1GB page is utilized. So when an
IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a
tail page's _count does not equal zero.
kernel BUG at include/linux/mm.h:386!
invalid opcode: 0000 [#1] SMP
Call Trace:
gup_pud_range+0xb8/0x19d
get_user_pages_fast+0xcb/0x192
? trace_hardirqs_off+0xd/0xf
hva_to_pfn+0x119/0x2f2
gfn_to_pfn_memslot+0x2c/0x2e
kvm_iommu_map_pages+0xfd/0x1c1
kvm_iommu_map_memslots+0x7c/0xbd
kvm_iommu_map_guest+0xaa/0xbf
kvm_vm_ioctl_assigned_device+0x2ef/0xa47
kvm_vm_ioctl+0x36c/0x3a2
do_vfs_ioctl+0x49e/0x4e4
sys_ioctl+0x5a/0x7c
system_call_fastpath+0x16/0x1b
RIP gup_huge_pud+0xf2/0x159
Signed-off-by: Youquan Song <youquan.song@intel.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
Conflicts:
arch/arm/Kconfig
Change-Id: If8aaaf3efcbbf6c9017b38efb6d76ef933f147fa
Signed-off-by: Varun Wadekar <vwadekar@nvidia.com>
|
|
commit 52cef189165d74a5d6030184a8e05595194c69ca upstream.
Commit 30765b92 ("slab, lockdep: Annotate the locks before using
them") moves the init_lock_keys() call from after g_cpucache_up =
FULL, to before it. And overlooks the fact that init_node_lock_keys()
tests for it and ignores everything !FULL.
Introduce a LATE stage and change the lockdep test to be <LATE.
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit ea4039a34c4c206d015d34a49d0b00868e37db1d upstream.
If we fail to prepare an anon_vma, the {new, old}_page should be released,
or they will leak.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
Under the following conditions, __alloc_pages_slowpath can loop
forever:
gfp_mask & __GFP_WAIT is true
gfp_mask & __GFP_FS is false
reclaim and compaction make no progress
order <= PAGE_ALLOC_COSTLY_ORDER
The gfp conditions are normally invalid, because !__GFP_FS
disables most of the reclaim methods that __GFP_WAIT would
wait for. However, these conditions happen very often during
suspend and resume, when pm_restrict_gfp_mask() effectively
converts all GFP_KERNEL allocations into __GFP_WAIT.
The oom killer is not run because gfp_mask & __GFP_FS is false,
but should_alloc_retry will always return true when order is less
than PAGE_ALLOC_COSTLY_ORDER. __alloc_pages_slowpath will
loop forever between the rebalance label and should_alloc_retry,
unless another thread happens to release enough pages to satisfy
the allocation.
Add a check to detect when PM has disabled __GFP_FS, and do not
retry if reclaim is not making any progress.
[taken from patch on lkml by Mel Gorman, commit message by ccross]
Change-Id: I864a24e9d9fd98bd0e3d6e9c1e85b6c1b766850e
Signed-off-by: Colin Cross <ccross@android.com>
|
|
The arguments to shrink functions have changed, update
ashmem_shrink to match.
Change-Id: Id279d22d761a2a7c4965c957960eef804d06cc07
Signed-off-by: Colin Cross <ccross@android.com>
|
|
Signed-off-by: Bjorn Bringert <bringert@android.com>
Change-Id: I509d18b21832e229737ea7ebaa231fb107eb61d7
|
|
Change-Id: Ie527d18f3352ede06d565826c8d35ded1638203a
Signed-off-by: Colin Cross <ccross@google.com>
|
|
Change-Id: I1412cc9560de8c4feb1162fc30922f0e3362a476
Signed-off-by: Arve Hjønnevåg <arve@android.com>
|
|
Bug: 2595601
Change-Id: I47c0016f594f9354fb8658ccb26e3d395bcb137b
Signed-off-by: Bjorn Bringert <bringert@android.com>
|
|
Forward port of ashmem to 2.6.27.
Signed-off-by: Robert Love <rlove@google.com>
ashmem: Don't install fault handler for private mmaps.
Ashmem is used to create named private heaps. If this heap is backed
by a tmpfs file it will allocate two pages for every page touched.
In 2.6.27, the extra page would later be freed, but 2.6.29 does not
scan anonymous pages when running without swap so the memory is not
freed while the file is referenced. This change changes the behavior
of private ashmem mmaps to match /dev/zero instead tmpfs.
Signed-off-by: Arve Hjønnevåg <arve@android.com>
ashmem: Add common prefix to name reported in /proc/pid/maps
Signed-off-by: Arve Hjønnevåg <arve@android.com>
ashmem: don't require a page aligned size
This makes ashmem more similar to shmem and mmap, by
not requiring the specified size to be page aligned,
instead rounding it internally as needed.
Signed-off-by: Marco Nelissen <marcone@android.com>
|
|
By default the kernel tries to keep half as much memory free at each
order as it does for one order below. This can be too agressive when
running without swap.
Change-Id: I5efc1a0b50f41ff3ac71e92d2efd175dedd54ead
Signed-off-by: Arve Hjønnevåg <arve@android.com>
|
|
commit 7a401a972df8e184b3d1a3fc958c0a4ddee8d312 upstream.
bdi_prune_sb() in bdi_unregister() attempts to removes the bdi links
from all super_blocks and then del_timer_sync() the writeback timer.
However, this can race with __mark_inode_dirty(), leading to
bdi_wakeup_thread_delayed() rearming the writeback timer on the bdi
we're unregistering, after we've called del_timer_sync().
This can end up with the bdi being freed with an active timer inside it,
as in the case of the following dump after the removal of an SD card.
Fix this by redoing the del_timer_sync() in bdi_destory().
------------[ cut here ]------------
WARNING: at /home/rabin/kernel/arm/lib/debugobjects.c:262 debug_print_object+0x9c/0xc8()
ODEBUG: free active (active state 0) object type: timer_list hint: wakeup_timer_fn+0x0/0x180
Modules linked in:
Backtrace:
[<c00109dc>] (dump_backtrace+0x0/0x110) from [<c0236e4c>] (dump_stack+0x18/0x1c)
r6:c02bc638 r5:00000106 r4:c79f5d18 r3:00000000
[<c0236e34>] (dump_stack+0x0/0x1c) from [<c0025e6c>] (warn_slowpath_common+0x54/0x6c)
[<c0025e18>] (warn_slowpath_common+0x0/0x6c) from [<c0025f28>] (warn_slowpath_fmt+0x38/0x40)
r8:20000013 r7:c780c6f0 r6:c031613c r5:c780c6f0 r4:c02b1b29
r3:00000009
[<c0025ef0>] (warn_slowpath_fmt+0x0/0x40) from [<c015eb4c>] (debug_print_object+0x9c/0xc8)
r3:c02b1b29 r2:c02bc662
[<c015eab0>] (debug_print_object+0x0/0xc8) from [<c015f574>] (debug_check_no_obj_freed+0xac/0x1dc)
r6:c7964000 r5:00000001 r4:c7964000
[<c015f4c8>] (debug_check_no_obj_freed+0x0/0x1dc) from [<c00a9e38>] (kmem_cache_free+0x88/0x1f8)
[<c00a9db0>] (kmem_cache_free+0x0/0x1f8) from [<c014286c>] (blk_release_queue+0x70/0x78)
[<c01427fc>] (blk_release_queue+0x0/0x78) from [<c015290c>] (kobject_release+0x70/0x84)
r5:c79641f0 r4:c796420c
[<c015289c>] (kobject_release+0x0/0x84) from [<c0153ce4>] (kref_put+0x68/0x80)
r7:00000083 r6:c74083d0 r5:c015289c r4:c796420c
[<c0153c7c>] (kref_put+0x0/0x80) from [<c01527d0>] (kobject_put+0x48/0x5c)
r5:c79643b4 r4:c79641f0
[<c0152788>] (kobject_put+0x0/0x5c) from [<c013ddd8>] (blk_cleanup_queue+0x68/0x74)
r4:c7964000
[<c013dd70>] (blk_cleanup_queue+0x0/0x74) from [<c01a6370>] (mmc_blk_put+0x78/0xe8)
r5:00000000 r4:c794c400
[<c01a62f8>] (mmc_blk_put+0x0/0xe8) from [<c01a64b4>] (mmc_blk_release+0x24/0x38)
r5:c794c400 r4:c0322824
[<c01a6490>] (mmc_blk_release+0x0/0x38) from [<c00de11c>] (__blkdev_put+0xe8/0x170)
r5:c78d5e00 r4:c74083c0
[<c00de034>] (__blkdev_put+0x0/0x170) from [<c00de2c0>] (blkdev_put+0x11c/0x12c)
r8:c79f5f70 r7:00000001 r6:c74083d0 r5:00000083 r4:c74083c0
r3:00000000
[<c00de1a4>] (blkdev_put+0x0/0x12c) from [<c00b0724>] (kill_block_super+0x60/0x6c)
r7:c7942300 r6:c79f4000 r5:00000083 r4:c74083c0
[<c00b06c4>] (kill_block_super+0x0/0x6c) from [<c00b0a94>] (deactivate_locked_super+0x44/0x70)
r6:c79f4000 r5:c031af64 r4:c794dc00 r3:c00b06c4
[<c00b0a50>] (deactivate_locked_super+0x0/0x70) from [<c00b1358>] (deactivate_super+0x6c/0x70)
r5:c794dc00 r4:c794dc00
[<c00b12ec>] (deactivate_super+0x0/0x70) from [<c00c88b0>] (mntput_no_expire+0x188/0x194)
r5:c794dc00 r4:c7942300
[<c00c8728>] (mntput_no_expire+0x0/0x194) from [<c00c95e0>] (sys_umount+0x2e4/0x310)
r6:c7942300 r5:00000000 r4:00000000 r3:00000000
[<c00c92fc>] (sys_umount+0x0/0x310) from [<c000d940>] (ret_fast_syscall+0x0/0x30)
---[ end trace e5c83c92ada51c76 ]---
Signed-off-by: Rabin Vincent <rabin.vincent@stericsson.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 70b50f94f1644e2aa7cb374819cfd93f3c28d725 upstream.
Michel while working on the working set estimation code, noticed that
calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
wasn't safe, if the pfn ended up being a tail page of a transparent
hugepage under splitting by __split_huge_page_refcount().
He then found the problem could also theoretically materialize with
page_cache_get_speculative() during the speculative radix tree lookups
that uses get_page_unless_zero() in SMP if the radix tree page is freed
and reallocated and get_user_pages is called on it before
page_cache_get_speculative has a chance to call get_page_unless_zero().
So the best way to fix the problem is to keep page_tail->_count zero at
all times. This will guarantee that get_page_unless_zero() can never
succeed on any tail page. page_tail->_mapcount is guaranteed zero and
is unused for all tail pages of a compound page, so we can simply
account the tail page references there and transfer them to
tail_page->_count in __split_huge_page_refcount() (in addition to the
head_page->_mapcount).
While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't
entirely safe because the two atomic_inc in get_page weren't atomic. As
opposed to other get_user_page users like secondary-MMU page fault to
establish the shadow pagetables would never call any superflous get_page
after get_user_page returns. It's safer to make get_page universally safe
for tail pages and to use get_page_foll() within follow_page (inside
get_user_pages()). get_page_foll() is safe to do the refcounting for tail
pages without taking any locks because it is run within PT lock protected
critical sections (PT lock for pte and page_table_lock for
pmd_trans_huge).
The standard get_page() as invoked by direct-io instead will now take
the compound_lock but still only for tail pages. The direct-io paths
are usually I/O bound and the compound_lock is per THP so very
finegrined, so there's no risk of scalability issues with it. A simple
direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no
overhead. So it's worth it. Ideally direct-io should stop calling
get_page() on pages returned by get_user_pages(). The spinlock in
get_page() is already optimized away for no-THP builds but doing
get_page() on tail pages returned by GUP is generally a rare operation
and usually only run in I/O paths.
This new refcounting on page_tail->_mapcount in addition to avoiding new
RCU critical sections will also allow the working set estimation code to
work without any further complexity associated to the tail page
refcounting with THP.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit f5252e009d5b87071a919221e4f6624184005368 upstream.
The /proc/vmallocinfo shows information about vmalloc allocations in
vmlist that is a linklist of vm_struct. It, however, may access pages
field of vm_struct where a page was not allocated. This results in a null
pointer access and leads to a kernel panic.
Why this happens: In __vmalloc_node_range() called from vmalloc(), newly
allocated vm_struct is added to vmlist at __get_vm_area_node() and then,
some fields of vm_struct such as nr_pages and pages are set at
__vmalloc_area_node(). In other words, it is added to vmlist before it is
fully initialized. At the same time, when the /proc/vmallocinfo is read,
it accesses the pages field of vm_struct according to the nr_pages field
at show_numa_info(). Thus, a null pointer access happens.
The patch adds the newly allocated vm_struct to the vmlist *after* it is
fully initialized. So, it can avoid accessing the pages field with
unallocated page when show_numa_info() is called.
Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
I don't usually pay much attention to the stale "? " addresses in
stack backtraces, but this lucky report from Pawel Sikora hints that
mremap's move_ptes() has inadequate locking against page migration.
3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
kernel BUG at include/linux/swapops.h:105!
RIP: 0010:[<ffffffff81127b76>] [<ffffffff81127b76>]
migration_entry_wait+0x156/0x160
[<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
[<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
[<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
[<ffffffff81102a31>] handle_mm_fault+0x181/0x310
[<ffffffff81106097>] ? vma_adjust+0x537/0x570
[<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
[<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
[<ffffffff81421d5f>] page_fault+0x1f/0x30
mremap's down_write of mmap_sem, together with i_mmap_mutex or lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in, and enough
while migration always held mmap_sem; but not enough nowadays, when
there's memory hotremove and compaction.
The danger is that move_ptes() lets a migration entry dodge around
behind remove_migration_pte()'s back, so it's in the old location when
looking at the new, then in the new location when looking at the old.
Either mremap's move_ptes() must additionally take anon_vma lock(), or
migration's remove_migration_pte() must stop peeking for is_swap_entry()
before it takes pagetable lock.
Consensus chooses the latter: we prefer to add overhead to migration
than to mremapping, which gets used by JVMs and by exec stack setup.
Reported-and-tested-by: Paweł Sikora <pluto@agmk.net>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
* 'for-linus' of git://git.kernel.dk/linux-block:
floppy: use del_timer_sync() in init cleanup
blk-cgroup: be able to remove the record of unplugged device
block: Don't check QUEUE_FLAG_SAME_COMP in __blk_complete_request
mm: Add comment explaining task state setting in bdi_forker_thread()
mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread()
block: simplify force plug flush code a little bit
block: change force plug flush call order
block: Fix queue_flag update when rq_affinity goes from 2 to 1
block: separate priority boosting from REQ_META
block: remove READ_META and WRITE_META
xen-blkback: fixed indentation and comments
xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.
|
|
* 'slab/urgent' of git://github.com/penberg/linux:
slub: add slab with one free object to partial list tail
|
|
The found entries by find_get_pages() could be all swap entries. In
this case we skip the entries, but make sure the skipped entries are
accounted, so we don't keep looping.
Using nr_found > nr_skip to simplify code as suggested by Eric.
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Xen backend drivers (e.g., blkback and netback) would sometimes fail to
map grant pages into the vmalloc address space allocated with
alloc_vm_area(). The GNTTABOP_map_grant_ref would fail because Xen could
not find the page (in the L2 table) containing the PTEs it needed to
update.
(XEN) mm.c:3846:d0 Could not find L1 PTE for address fbb42000
netback and blkback were making the hypercall from a kernel thread where
task->active_mm != &init_mm and alloc_vm_area() was only updating the page
tables for init_mm. The usual method of deferring the update to the page
tables of other processes (i.e., after taking a fault) doesn't work as a
fault cannot occur during the hypercall.
This would work on some systems depending on what else was using vmalloc.
Fix this by reverting ef691947d8a3 ("vmalloc: remove vmalloc_sync_all()
from alloc_vm_area()") and add a comment to explain why it's needed.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Keir Fraser <keir.xen@gmail.com>
Cc: <stable@kernel.org> [3.0.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Revert the post-3.0 commit 82f9d486e59f5 ("memcg: add
memory.vmscan_stat").
The implementation of per-memcg reclaim statistics violates how memcg
hierarchies usually behave: hierarchically.
The reclaim statistics are accounted to child memcgs and the parent
hitting the limit, but not to hierarchy levels in between. Usually,
hierarchical statistics are perfectly recursive, with each level
representing the sum of itself and all its children.
Since this exports statistics to userspace, this may lead to confusion
and problems with changing things after the release, so revert it now,
we can try again later.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Without swap, anonymous pages are not scanned. As such, they should not
count when considering force-scanning a small target if there is no swap.
Otherwise, targets are not force-scanned even when their effective scan
number is zero and the other conditions--kswapd/memcg--apply.
This fixes 246e87a93934 ("memcg: fix get_scan_count() for small
targets").
[akpm@linux-foundation.org: fix comment]
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The vmstat_text array is only defined for CONFIG_SYSFS or CONFIG_PROC_FS,
yet it is referenced for per-node vmstat with CONFIG_NUMA:
drivers/built-in.o: In function `node_read_vmstat':
node.c:(.text+0x1106df): undefined reference to `vmstat_text'
Introduced in commit fa25c503dfa2 ("mm: per-node vmstat: show proper
vmstats").
Define the array for CONFIG_NUMA as well.
[akpm@linux-foundation.org: remove unneeded ifdefs]
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Cong Wang <amwang@redhat.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When compiling mm/mempolicy.c with struct user copy checks the following
warning is shown:
In file included from arch/x86/include/asm/uaccess.h:572,
from include/linux/uaccess.h:5,
from include/linux/highmem.h:7,
from include/linux/pagemap.h:10,
from include/linux/mempolicy.h:70,
from mm/mempolicy.c:68:
In function `copy_from_user',
inlined from `compat_sys_get_mempolicy' at mm/mempolicy.c:1415:
arch/x86/include/asm/uaccess_64.h:64: warning: call to `copy_from_user_overflow' declared with attribute warning: copy_from_user() buffer size is not provably correct
LD mm/built-in.o
Fix this by passing correct buffer size value.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") didn't really
fix the mbind vma merge problem due to wrong pgoff value passing to
vma_merge(), which made vma_merge() always return NULL.
Before the patch applied, we are getting a result like:
addr = 0x7fa58f00c000
[snip]
7fa58f00c000-7fa58f00d000 rw-p 00000000 00:00 0
7fa58f00d000-7fa58f00e000 rw-p 00000000 00:00 0
7fa58f00e000-7fa58f00f000 rw-p 00000000 00:00 0
here 7fa58f00c000->7fa58f00f000 we get 3 VMAs which are expected to be
merged described as described in commit 9d8cebd.
Re-testing the patched kernel with the reproducer provided in commit
9d8cebd, we get the correct result:
addr = 0x7ffa5aaa2000
[snip]
7ffa5aaa2000-7ffa5aaa6000 rw-p 00000000 00:00 0
7fffd556f000-7fffd5584000 rw-p 00000000 00:00 0 [stack]
Signed-off-by: Caspar Zhang <caspar@casparzhang.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
CC: Wu Fengguang <fengguang.wu@intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
bdi_forker_thread() clears BDI_pending bit at the end of the main loop.
However clearing of this bit must not be done in some cases which is
handled by calling 'continue' from switch statement. That's kind of
unusual construct and without a good reason so change the function into
more intuitive code flow.
CC: Wu Fengguang <fengguang.wu@intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
The slab has just one free object, adding it to partial list head doesn't make
sense. And it can cause lock contentation. For example,
1. CPU takes the slab from partial list
2. fetch an object
3. switch to another slab
4. free an object, then the slab is added to partial list again
In this way n->list_lock will be heavily contended.
In fact, Alex had a hackbench regression. 3.1-rc1 performance drops about 70%
against 3.0. This patch fixes it.
Acked-by: Christoph Lameter <cl@linux.com>
Reported-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
|
|
Commit 79dfdaccd1d5 ("memcg: make oom_lock 0 and 1 based rather than
counter") tried to oom lock the hierarchy and roll back upon
encountering an already locked memcg.
The code is confused when it comes to detecting a locked memcg, though,
so it would fail and rollback after locking one memcg and encountering
an unlocked second one.
The result is that oom-locking hierarchies fails unconditionally and
that every oom killer invocation simply goes to sleep on the oom
waitqueue forever. The tasks practically hang forever without anyone
intervening, possibly holding locks that trip up unrelated tasks, too.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
ZONE_CONGESTED is only cleared in kswapd, but pages can be freed in any
task. It's possible ZONE_CONGESTED isn't cleared in some cases:
1. the zone is already balanced just entering balance_pgdat() for
order-0 because concurrent tasks free memory. In this case, later
check will skip the zone as it's balanced so the flag isn't cleared.
2. high order balance fallbacks to order-0. quote from Mel: At the
end of balance_pgdat(), kswapd uses the following logic;
If reclaiming at high order {
for each zone {
if all_unreclaimable
skip
if watermark is not met
order = 0
loop again
/* watermark is met */
clear congested
}
}
i.e. it clears ZONE_CONGESTED if it the zone is balanced. if not,
it restarts balancing at order-0. However, if the higher zones are
balanced for order-0, kswapd will miss clearing ZONE_CONGESTED as
that only happens after a zone is shrunk. This can mean that
wait_iff_congested() stalls unnecessarily.
This patch makes kswapd clear ZONE_CONGESTED during its initial
highmem->dma scan for zones that are already balanced.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
I get the below warning:
BUG: using smp_processor_id() in preemptible [00000000] code: bash/746
caller is native_sched_clock+0x37/0x6e
Pid: 746, comm: bash Tainted: G W 3.0.0+ #254
Call Trace:
[<ffffffff813435c6>] debug_smp_processor_id+0xc2/0xdc
[<ffffffff8104158d>] native_sched_clock+0x37/0x6e
[<ffffffff81116219>] try_to_free_mem_cgroup_pages+0x7d/0x270
[<ffffffff8114f1f8>] mem_cgroup_force_empty+0x24b/0x27a
[<ffffffff8114ff21>] ? sys_close+0x38/0x138
[<ffffffff8114ff21>] ? sys_close+0x38/0x138
[<ffffffff8114f257>] mem_cgroup_force_empty_write+0x17/0x19
[<ffffffff810c72fb>] cgroup_file_write+0xa8/0xba
[<ffffffff811522d2>] vfs_write+0xb3/0x138
[<ffffffff8115241a>] sys_write+0x4a/0x71
[<ffffffff8114ffd9>] ? sys_close+0xf0/0x138
[<ffffffff8176deab>] system_call_fastpath+0x16/0x1b
sched_clock() can't be used with preempt enabled. And we don't need
fast approach to get clock here, so let's use ktime API.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|