linux-toradex.git/mm, branch v2.6.27.50

do_generic_file_read: clear page errors when issuing a fresh read of the page

2010-07-05T18:08:44+00:00

commit 91803b499cca2fe558abad709ce83dc896b80950 upstream.

I/O errors can happen due to temporary failures, like multipath
errors or losing network contact with the iSCSI server. Because
of that, the VM will retry readpage on the page.

However, do_generic_file_read does not clear PG_error.  This
causes the system to be unable to actually use the data in the
page cache page, even if the subsequent readpage completes
successfully!

The function filemap_fault has had a ClearPageError before
readpage forever.  This patch simply adds the same to
do_generic_file_read.

Signed-off-by: Jeff Moyer 
Signed-off-by: Rik van Riel 
Acked-by: Larry Woodman 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

nfsd: fix vm overcommit crash

2010-05-26T21:27:09+00:00

commit 731572d39fcd3498702eda4600db4c43d51e0b26 upstream.

Junjiro R.  Okajima reported a problem where knfsd crashes if you are
using it to export shmemfs objects and run strict overcommit.  In this
situation the current->mm based modifier to the overcommit goes through a
NULL pointer.

We could simply check for NULL and skip the modifier but we've caught
other real bugs in the past from mm being NULL here - cases where we did
need a valid mm set up (eg the exec bug about a year ago).

To preserve the checks and get the logic we want shuffle the checking
around and add a new helper to the vm_ security wrappers

Also fix a current->mm reference in nommu that should use the passed mm

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix build]
Reported-by: Junjiro R. Okajima 
Acked-by: James Morris 
Signed-off-by: Alan Cox 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

vfs: Remove the range_cont writeback mode.

2010-05-26T21:27:06+00:00

commit 74baaaaec8b4f22e1ae279f5ecca4ff705b28912 upstream.

Ext4 was the only user of range_cont writeback mode and ext4 switched
to a different method. So remove the range_cont mode which is not used
in the kernel.

Signed-off-by: Aneesh Kumar K.V 
Signed-off-by: "Theodore Ts'o" 
CC: linux-fsdevel@vger.kernel.org
Signed-off-by: Jayson R. King 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Greg Kroah-Hartman

tmpfs: cleanup mpol_parse_str()

2010-04-01T22:52:29+00:00

commit 926f2ae04f183098cf9a30521776fb2759c8afeb upstream.

mpol_parse_str() made lots 'err' variable related bug.  Because it is ugly
and reviewing unfriendly.

This patch simplifies it.

Signed-off-by: KOSAKI Motohiro 
Cc: Ravikiran Thirumalai 
Cc: Christoph Lameter 
Cc: Mel Gorman 
Acked-by: Lee Schermerhorn 
Cc: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

tmpfs: handle MPOL_LOCAL mount option properly

2010-04-01T22:52:28+00:00

commit 12821f5fb942e795f8009ece14bde868893bd811 upstream.

commit 71fe804b6d5 (mempolicy: use struct mempolicy pointer in
shmem_sb_info) added mpol=local mount option.  but its feature is broken
since it was born.  because such code always return 1 (i.e.  mount
failure).

This patch fixes it.

Signed-off-by: KOSAKI Motohiro 
Cc: Ravikiran Thirumalai 
Cc: Christoph Lameter 
Cc: Mel Gorman 
Acked-by: Lee Schermerhorn 
Cc: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

tmpfs: mpol=bind:0 don't cause mount error.

2010-04-01T22:52:28+00:00

commit d69b2e63e9172afb4d07c305601b79a55509ac4c upstream.

Currently, following mount operation cause mount error.

% mount -t tmpfs -ompol=bind:0 none /tmp

Because commit 71fe804b6d5 (mempolicy: use struct mempolicy pointer in
shmem_sb_info) corrupted MPOL_BIND parse code.

This patch restore the needed one.

Signed-off-by: KOSAKI Motohiro 
Cc: Ravikiran Thirumalai 
Cc: Christoph Lameter 
Cc: Mel Gorman 
Acked-by: Lee Schermerhorn 
Cc: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

tmpfs: fix oops on mounts with mpol=default

2010-04-01T22:52:27+00:00

commit 413b43deab8377819aba1dbad2abf0c15d59b491 upstream.

Fix an 'oops' when a tmpfs mount point is mounted with the mpol=default
mempolicy.

Upon remounting a tmpfs mount point with 'mpol=default' option, the mount
code crashed with a null pointer dereference.  The initial problem report
was on 2.6.27, but the problem exists in mainline 2.6.34-rc as well.  On
examining the code, we see that mpol_new returns NULL if default mempolicy
was requested.  This 'NULL' mempolicy is accessed to store the node mask
resulting in oops.

The following patch fixes it.

Signed-off-by: Ravikiran Thirumalai 
Signed-off-by: KOSAKI Motohiro 
Cc: Christoph Lameter 
Cc: Mel Gorman 
Acked-by: Lee Schermerhorn 
Cc: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

Fix potential crash with sys_move_pages

2010-04-01T22:52:15+00:00

commit 6f5a55f1a6c5abee15a0e878e5c74d9f1569b8b0 upstream.

We incorrectly depended on the 'node_state/node_isset()' functions
testing the node range, rather than checking it explicitly.  That's not
reliable, even if it might often happen to work.  So do the proper
explicit test.

Reported-by: Marcus Meissner 
Acked-and-tested-by: Brice Goglin 
Acked-by: Hugh Dickins 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mbind(): fix leak of never putback pages

2009-11-10T00:52:13+00:00

commit ab8a3e14e6f8e567560f664bbd29aefb306a274e upstream.

If mbind() receives an invalid address, do_mbind leaks a page.  The
following test program detects this leak.

This patch fixes it.

migrate_efault.c
=======================================
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 

static unsigned long pagesize;

static void* make_hole_mapping(void)
{

	void* addr;

	addr = mmap(NULL, pagesize*3, PROT_READ|PROT_WRITE,
		    MAP_ANON|MAP_PRIVATE, 0, 0);
	if (addr == MAP_FAILED)
		return NULL;

	/* make page populate */
	memset(addr, 0, pagesize*3);

	/* make memory hole */
	munmap(addr+pagesize, pagesize);

	return addr;
}

int main(int argc, char** argv)
{
	void* addr;
	int ch;
	int node;
	struct bitmask *nmask = numa_allocate_nodemask();
	int err;
	int node_set = 0;

	while ((ch = getopt(argc, argv, "n:")) != -1){
		switch (ch){
		case 'n':
			node = strtol(optarg, NULL, 0);
			numa_bitmask_setbit(nmask, node);
			node_set = 1;
			break;
		default:
			;
		}
	}
	argc -= optind;
	argv += optind;

	if (!node_set)
		numa_bitmask_setbit(nmask, 0);

	pagesize = getpagesize();

	addr = make_hole_mapping();

	err = mbind(addr, pagesize*3, MPOL_BIND, nmask->maskp, nmask->size, MPOL_MF_MOVE_ALL);
	if (err)
		perror("mbind ");

	return 0;
}
=======================================

Signed-off-by: KOSAKI Motohiro 
Acked-by: Christoph Lameter 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mmap: avoid unnecessary anon_vma lock acquisition in vma_adjust()

2009-10-05T15:11:51+00:00

commit 252c5f94d944487e9f50ece7942b0fbf659c5c31 upstream.

We noticed very erratic behavior [throughput] with the AIM7 shared
workload running on recent distro [SLES11] and mainline kernels on an
8-socket, 32-core, 256GB x86_64 platform.  On the SLES11 kernel
[2.6.27.19+] with Barcelona processors, as we increased the load [10s of
thousands of tasks], the throughput would vary between two "plateaus"--one
at ~65K jobs per minute and one at ~130K jpm.  The simple patch below
causes the results to smooth out at the ~130k plateau.

But wait, there's more:

We do not see this behavior on smaller platforms--e.g., 4 socket/8 core.
This could be the result of the larger number of cpus on the larger
platform--a scalability issue--or it could be the result of the larger
number of interconnect "hops" between some nodes in this platform and how
the tasks for a given load end up distributed over the nodes' cpus and
memories--a stochastic NUMA effect.

The variability in the results are less pronounced [on the same platform]
with Shanghai processors and with mainline kernels.  With 31-rc6 on
Shanghai processors and 288 file systems on 288 fibre attached storage
volumes, the curves [jpm vs load] are both quite flat with the patched
kernel consistently producing ~3.9% better throughput [~80K jpm vs ~77K
jpm] than the unpatched kernel.

Profiling indicated that the "slow" runs were incurring high[er]
contention on an anon_vma lock in vma_adjust(), apparently called from the
sbrk() system call.

The patch:

A comment in mm/mmap.c:vma_adjust() suggests that we don't really need the
anon_vma lock when we're only adjusting the end of a vma, as is the case
for brk().  The comment questions whether it's worth while to optimize for
this case.  Apparently, on the newer, larger x86_64 platforms, with
interesting NUMA topologies, it is worth while--especially considering
that the patch [if correct!] is quite simple.

We can detect this condition--no overlap with next vma--by noting a NULL
"importer".  The anon_vma pointer will also be NULL in this case, so
simply avoid loading vma->anon_vma to avoid the lock.

However, we DO need to take the anon_vma lock when we're inserting a vma
['insert' non-NULL] even when we have no overlap [NULL "importer"], so we
need to check for 'insert', as well.  And Hugh points out that we should
also take it when adjusting vm_start (so that rmap.c can rely upon
vma_address() while it holds the anon_vma lock).

akpm: Zhang Yanmin reprts a 150% throughput improvement with aim7, so it
might be -stable material even though thiss isn't a regression: "this
issue is not clear on dual socket Nehalem machine (2*4*2 cpu), but is
severe on large machine (4*8*2 cpu)"

[hugh.dickins@tiscali.co.uk: test vma start too]
Signed-off-by: Lee Schermerhorn 
Signed-off-by: Hugh Dickins 
Cc: Nick Piggin 
Cc: Eric Whitney 
Tested-by: "Zhang, Yanmin" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman