<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/mm, branch v2.6.25.17</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>mm: make setup_zone_migrate_reserve() aware of overlapping nodes</title>
<updated>2008-09-08T10:20:16+00:00</updated>
<author>
<name>Adam Litke</name>
<email>agl@us.ibm.com</email>
</author>
<published>2008-09-03T02:35:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=30c90fbfe9498f7410131821510a95437016e77b'/>
<id>30c90fbfe9498f7410131821510a95437016e77b</id>
<content type='text'>
commit 344c790e3821dac37eb742ddd0b611a300f78b9a upstream

I have gotten to the root cause of the hugetlb badness I reported back on
August 15th.  My system has the following memory topology (note the
overlapping node):

            Node 0 Memory: 0x8000000-0x44000000
            Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000

setup_zone_migrate_reserve() scans the address range 0x0-0x8000000 looking
for a pageblock to move onto the MIGRATE_RESERVE list.  Finding no
candidates, it happily continues the scan into 0x8000000-0x44000000.  When
a pageblock is found, the pages are moved to the MIGRATE_RESERVE list on
the wrong zone.  Oops.

setup_zone_migrate_reserve() should skip pageblocks in overlapping nodes.

Signed-off-by: Adam Litke &lt;agl@us.ibm.com&gt;
Acked-by: Mel Gorman &lt;mel@csn.ul.ie&gt;
Cc: Dave Hansen &lt;dave@linux.vnet.ibm.com&gt;
Cc: Nishanth Aravamudan &lt;nacc@us.ibm.com&gt;
Cc: Andy Whitcroft &lt;apw@shadowen.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 344c790e3821dac37eb742ddd0b611a300f78b9a upstream

I have gotten to the root cause of the hugetlb badness I reported back on
August 15th.  My system has the following memory topology (note the
overlapping node):

            Node 0 Memory: 0x8000000-0x44000000
            Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000

setup_zone_migrate_reserve() scans the address range 0x0-0x8000000 looking
for a pageblock to move onto the MIGRATE_RESERVE list.  Finding no
candidates, it happily continues the scan into 0x8000000-0x44000000.  When
a pageblock is found, the pages are moved to the MIGRATE_RESERVE list on
the wrong zone.  Oops.

setup_zone_migrate_reserve() should skip pageblocks in overlapping nodes.

Signed-off-by: Adam Litke &lt;agl@us.ibm.com&gt;
Acked-by: Mel Gorman &lt;mel@csn.ul.ie&gt;
Cc: Dave Hansen &lt;dave@linux.vnet.ibm.com&gt;
Cc: Nishanth Aravamudan &lt;nacc@us.ibm.com&gt;
Cc: Andy Whitcroft &lt;apw@shadowen.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>mlock() fix return values</title>
<updated>2008-08-20T18:15:25+00:00</updated>
<author>
<name>KOSAKI Motohiro</name>
<email>kosaki.motohiro@jp.fujitsu.com</email>
</author>
<published>2008-08-05T00:20:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=a81a91f03af0cb89a8a5144d2bb2dc670000aef8'/>
<id>a81a91f03af0cb89a8a5144d2bb2dc670000aef8</id>
<content type='text'>
commit a477097d9c37c1cf289c7f0257dffcfa42d50197 upstream

Halesh says:

Please find the below testcase provide to test mlock.

Test Case :
===========================

#include &lt;sys/resource.h&gt;
#include &lt;stdio.h&gt;
#include &lt;sys/stat.h&gt;
#include &lt;sys/types.h&gt;
#include &lt;unistd.h&gt;
#include &lt;sys/mman.h&gt;
#include &lt;fcntl.h&gt;
#include &lt;errno.h&gt;
#include &lt;stdlib.h&gt;

int main(void)
{
  int fd,ret, i = 0;
  char *addr, *addr1 = NULL;
  unsigned int page_size;
  struct rlimit rlim;

  if (0 != geteuid())
  {
   printf("Execute this pgm as root\n");
   exit(1);
  }

  /* create a file */
  if ((fd = open("mmap_test.c",O_RDWR|O_CREAT,0755)) == -1)
  {
   printf("cant create test file\n");
   exit(1);
  }

  page_size = sysconf(_SC_PAGE_SIZE);

  /* set the MEMLOCK limit */
  rlim.rlim_cur = 2000;
  rlim.rlim_max = 2000;

  if ((ret = setrlimit(RLIMIT_MEMLOCK,&amp;rlim)) != 0)
  {
   printf("Cant change limit values\n");
   exit(1);
  }

  addr = 0;
  while (1)
  {
  /* map a page into memory each time*/
  if ((addr = (char *) mmap(addr,page_size, PROT_READ |
PROT_WRITE,MAP_SHARED,fd,0)) == MAP_FAILED)
  {
   printf("cant do mmap on file\n");
   exit(1);
  }

  if (0 == i)
    addr1 = addr;
  i++;
  errno = 0;
  /* lock the mapped memory pagewise*/
  if ((ret = mlock((char *)addr, 1500)) == -1)
  {
   printf("errno value is %d\n", errno);
   printf("cant lock maped region\n");
   exit(1);
  }
  addr = addr + page_size;
 }
}
======================================================

This testcase results in an mlock() failure with errno 14 that is EFAULT,
but it has nowhere been specified that mlock() will return EFAULT.  When I
tested the same on older kernels like 2.6.18, I got the correct result i.e
errno 12 (ENOMEM).

I think in source code mlock(2), setting errno ENOMEM has been missed in
do_mlock() , on mlock_fixup() failure.

SUSv3 requires the following behavior frmo mlock(2).

[ENOMEM]
    Some or all of the address range specified by the addr and
    len arguments does not correspond to valid mapped pages
    in the address space of the process.

[EAGAIN]
    Some or all of the memory identified by the operation could not
    be locked when the call was made.

This rule isn't so nice and slighly strange.  but many people think
POSIX/SUS compliance is important.

Reported-by: Halesh Sadashiv &lt;halesh.sadashiv@ap.sony.com&gt;
Tested-by: Halesh Sadashiv &lt;halesh.sadashiv@ap.sony.com&gt;
Signed-off-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit a477097d9c37c1cf289c7f0257dffcfa42d50197 upstream

Halesh says:

Please find the below testcase provide to test mlock.

Test Case :
===========================

#include &lt;sys/resource.h&gt;
#include &lt;stdio.h&gt;
#include &lt;sys/stat.h&gt;
#include &lt;sys/types.h&gt;
#include &lt;unistd.h&gt;
#include &lt;sys/mman.h&gt;
#include &lt;fcntl.h&gt;
#include &lt;errno.h&gt;
#include &lt;stdlib.h&gt;

int main(void)
{
  int fd,ret, i = 0;
  char *addr, *addr1 = NULL;
  unsigned int page_size;
  struct rlimit rlim;

  if (0 != geteuid())
  {
   printf("Execute this pgm as root\n");
   exit(1);
  }

  /* create a file */
  if ((fd = open("mmap_test.c",O_RDWR|O_CREAT,0755)) == -1)
  {
   printf("cant create test file\n");
   exit(1);
  }

  page_size = sysconf(_SC_PAGE_SIZE);

  /* set the MEMLOCK limit */
  rlim.rlim_cur = 2000;
  rlim.rlim_max = 2000;

  if ((ret = setrlimit(RLIMIT_MEMLOCK,&amp;rlim)) != 0)
  {
   printf("Cant change limit values\n");
   exit(1);
  }

  addr = 0;
  while (1)
  {
  /* map a page into memory each time*/
  if ((addr = (char *) mmap(addr,page_size, PROT_READ |
PROT_WRITE,MAP_SHARED,fd,0)) == MAP_FAILED)
  {
   printf("cant do mmap on file\n");
   exit(1);
  }

  if (0 == i)
    addr1 = addr;
  i++;
  errno = 0;
  /* lock the mapped memory pagewise*/
  if ((ret = mlock((char *)addr, 1500)) == -1)
  {
   printf("errno value is %d\n", errno);
   printf("cant lock maped region\n");
   exit(1);
  }
  addr = addr + page_size;
 }
}
======================================================

This testcase results in an mlock() failure with errno 14 that is EFAULT,
but it has nowhere been specified that mlock() will return EFAULT.  When I
tested the same on older kernels like 2.6.18, I got the correct result i.e
errno 12 (ENOMEM).

I think in source code mlock(2), setting errno ENOMEM has been missed in
do_mlock() , on mlock_fixup() failure.

SUSv3 requires the following behavior frmo mlock(2).

[ENOMEM]
    Some or all of the address range specified by the addr and
    len arguments does not correspond to valid mapped pages
    in the address space of the process.

[EAGAIN]
    Some or all of the memory identified by the operation could not
    be locked when the call was made.

This rule isn't so nice and slighly strange.  but many people think
POSIX/SUS compliance is important.

Reported-by: Halesh Sadashiv &lt;halesh.sadashiv@ap.sony.com&gt;
Tested-by: Halesh Sadashiv &lt;halesh.sadashiv@ap.sony.com&gt;
Signed-off-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>jbd: fix race between free buffer and commit transaction</title>
<updated>2008-08-06T17:10:58+00:00</updated>
<author>
<name>Mingming Cao</name>
<email>cmm@us.ibm.com</email>
</author>
<published>2008-07-25T08:46:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=53f501a19fc08631458f228b5ad2bdc46d65846b'/>
<id>53f501a19fc08631458f228b5ad2bdc46d65846b</id>
<content type='text'>
commit 3f31fddfa26b7594b44ff2b34f9a04ba409e0f91 upstream

journal_try_to_free_buffers() could race with jbd commit transaction when
the later is holding the buffer reference while waiting for the data
buffer to flush to disk.  If the caller of journal_try_to_free_buffers()
request tries hard to release the buffers, it will treat the failure as
error and return back to the caller.  We have seen the directo IO failed
due to this race.  Some of the caller of releasepage() also expecting the
buffer to be dropped when passed with GFP_KERNEL mask to the
releasepage()-&gt;journal_try_to_free_buffers().

With this patch, if the caller is passing the __GFP_WAIT and __GFP_FS to
indicating this call could wait, in case of try_to_free_buffers() failed,
let's waiting for journal_commit_transaction() to finish commit the
current committing transaction, then try to free those buffers again.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mingming Cao &lt;cmm@us.ibm.com&gt;
Reviewed-by: Badari Pulavarty &lt;pbadari@us.ibm.com&gt;
Acked-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 3f31fddfa26b7594b44ff2b34f9a04ba409e0f91 upstream

journal_try_to_free_buffers() could race with jbd commit transaction when
the later is holding the buffer reference while waiting for the data
buffer to flush to disk.  If the caller of journal_try_to_free_buffers()
request tries hard to release the buffers, it will treat the failure as
error and return back to the caller.  We have seen the directo IO failed
due to this race.  Some of the caller of releasepage() also expecting the
buffer to be dropped when passed with GFP_KERNEL mask to the
releasepage()-&gt;journal_try_to_free_buffers().

With this patch, if the caller is passing the __GFP_WAIT and __GFP_FS to
indicating this call could wait, in case of try_to_free_buffers() failed,
let's waiting for journal_commit_transaction() to finish commit the
current committing transaction, then try to free those buffers again.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mingming Cao &lt;cmm@us.ibm.com&gt;
Reviewed-by: Badari Pulavarty &lt;pbadari@us.ibm.com&gt;
Acked-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>Fix off-by-one error in iov_iter_advance()</title>
<updated>2008-08-01T18:51:02+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2008-07-30T22:20:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=18b8506ad15f608f7c0e0745c55dd2cd5ff7c091'/>
<id>18b8506ad15f608f7c0e0745c55dd2cd5ff7c091</id>
<content type='text'>
commit 94ad374a0751f40d25e22e036c37f7263569d24c upstream

The iov_iter_advance() function would look at the iov-&gt;iov_len entry
even though it might have iterated over the whole array, and iov was
pointing past the end.  This would cause DEBUG_PAGEALLOC to trigger a
kernel page fault if the allocation was at the end of a page, and the
next page was unallocated.

The quick fix is to just change the order of the tests: check that there
is any iovec data left before we check the iov entry itself.

Thanks to Alexey Dobriyan for finding this case, and testing the fix.

Reported-and-tested-by: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: Nick Piggin &lt;npiggin@suse.de&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 94ad374a0751f40d25e22e036c37f7263569d24c upstream

The iov_iter_advance() function would look at the iov-&gt;iov_len entry
even though it might have iterated over the whole array, and iov was
pointing past the end.  This would cause DEBUG_PAGEALLOC to trigger a
kernel page fault if the allocation was at the end of a page, and the
next page was unallocated.

The quick fix is to just change the order of the tests: check that there
is any iovec data left before we check the iov entry itself.

Thanks to Alexey Dobriyan for finding this case, and testing the fix.

Reported-and-tested-by: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: Nick Piggin &lt;npiggin@suse.de&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>Correct hash flushing from huge_ptep_set_wrprotect()</title>
<updated>2008-08-01T18:51:01+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2008-07-18T05:55:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=482780d80ba2ab6e6bcbb4ec2bf868d2f9bd4628'/>
<id>482780d80ba2ab6e6bcbb4ec2bf868d2f9bd4628</id>
<content type='text'>
Correct hash flushing from huge_ptep_set_wrprotect() [stable tree version]

A fix for incorrect flushing of the hash page table at fork() for
hugepages was recently committed as
86df86424939d316b1f6cfac1b6204f0c7dee317.  Without this fix, a process
can make a MAP_PRIVATE hugepage mapping, then fork() and have writes
to the mapping after the fork() pollute the child's version.

Unfortunately this bug also exists in the stable branch.  In fact in
that case copy_hugetlb_page_range() from mm/hugetlb.c calls
ptep_set_wrprotect() directly, the hugepage variant hook
huge_ptep_set_wrprotect() doesn't even exist.

The patch below is a port of the fix to the stable25/master branch.
It introduces a huge_ptep_set_wrprotect() call, but this is #defined
to be equal to ptep_set_wrprotect() unless the arch defines its own
version and sets __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT.

This arch preprocessor flag is kind of nasty, but it seems the sanest
way to introduce this fix with minimum risk of breaking other archs
for whom prep_set_wprotect() is suitable for hugepages.

Signed-off-by: Andy Whitcroft &lt;apw@shadowen.org&gt;
Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Correct hash flushing from huge_ptep_set_wrprotect() [stable tree version]

A fix for incorrect flushing of the hash page table at fork() for
hugepages was recently committed as
86df86424939d316b1f6cfac1b6204f0c7dee317.  Without this fix, a process
can make a MAP_PRIVATE hugepage mapping, then fork() and have writes
to the mapping after the fork() pollute the child's version.

Unfortunately this bug also exists in the stable branch.  In fact in
that case copy_hugetlb_page_range() from mm/hugetlb.c calls
ptep_set_wrprotect() directly, the hugepage variant hook
huge_ptep_set_wrprotect() doesn't even exist.

The patch below is a port of the fix to the stable25/master branch.
It introduces a huge_ptep_set_wrprotect() call, but this is #defined
to be equal to ptep_set_wrprotect() unless the arch defines its own
version and sets __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT.

This arch preprocessor flag is kind of nasty, but it seems the sanest
way to introduce this fix with minimum risk of breaking other archs
for whom prep_set_wprotect() is suitable for hugepages.

Signed-off-by: Andy Whitcroft &lt;apw@shadowen.org&gt;
Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>tmpfs: fix kernel BUG in shmem_delete_inode</title>
<updated>2008-08-01T18:50:53+00:00</updated>
<author>
<name>Hugh Dickins</name>
<email>hugh@veritas.com</email>
</author>
<published>2008-07-29T02:50:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=007de77dcf9604bef3858183856d72ffa46fe0bf'/>
<id>007de77dcf9604bef3858183856d72ffa46fe0bf</id>
<content type='text'>
commit 14fcc23fdc78e9d32372553ccf21758a9bd56fa1 upstream

SuSE's insserve initscript ordering program hits kernel BUG at mm/shmem.c:814
on 2.6.26.  It's using posix_fadvise on directories, and the shmem_readpage
method added in 2.6.23 is letting POSIX_FADV_WILLNEED allocate useless pages
to a tmpfs directory, incrementing i_blocks count but never decrementing it.

Fix this by assigning shmem_aops (pointing to readpage and writepage and
set_page_dirty) only when it's needed, on a regular file or a long symlink.

Many thanks to Kel for outstanding bugreport and steps to reproduce it.

Reported-by: Kel Modderman &lt;kel@otaku42.de&gt;
Tested-by: Kel Modderman &lt;kel@otaku42.de&gt;
Signed-off-by: Hugh Dickins &lt;hugh@veritas.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 14fcc23fdc78e9d32372553ccf21758a9bd56fa1 upstream

SuSE's insserve initscript ordering program hits kernel BUG at mm/shmem.c:814
on 2.6.26.  It's using posix_fadvise on directories, and the shmem_readpage
method added in 2.6.23 is letting POSIX_FADV_WILLNEED allocate useless pages
to a tmpfs directory, incrementing i_blocks count but never decrementing it.

Fix this by assigning shmem_aops (pointing to readpage and writepage and
set_page_dirty) only when it's needed, on a regular file or a long symlink.

Many thanks to Kel for outstanding bugreport and steps to reproduce it.

Reported-by: Kel Modderman &lt;kel@otaku42.de&gt;
Tested-by: Kel Modderman &lt;kel@otaku42.de&gt;
Signed-off-by: Hugh Dickins &lt;hugh@veritas.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>slub: Fix use-after-preempt of per-CPU data structure</title>
<updated>2008-07-24T16:14:08+00:00</updated>
<author>
<name>Dmitry Adamushko</name>
<email>dmitry.adamushko@gmail.com</email>
</author>
<published>2008-07-11T01:20:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=46c88e2962809828a5105478f2a51d2ad10499d8'/>
<id>46c88e2962809828a5105478f2a51d2ad10499d8</id>
<content type='text'>
commit bdb21928512a860a60e6a24a849dc5b63cbaf96a upstream

Vegard Nossum reported a crash in kmem_cache_alloc():

	BUG: unable to handle kernel paging request at da87d000
	IP: [&lt;c01991c7&gt;] kmem_cache_alloc+0xc7/0xe0
	*pde = 28180163 *pte = 1a87d160
	Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
	Pid: 3850, comm: grep Not tainted (2.6.26-rc9-00059-gb190333 #5)
	EIP: 0060:[&lt;c01991c7&gt;] EFLAGS: 00210203 CPU: 0
	EIP is at kmem_cache_alloc+0xc7/0xe0
	EAX: 00000000 EBX: da87c100 ECX: 1adad71a EDX: 6b6b6b6b
	ESI: 00200282 EDI: da87d000 EBP: f60bfe74 ESP: f60bfe54
	DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068

and analyzed it:

  "The register %ecx looks innocent but is very important here. The disassembly:

       mov    %edx,%ecx
       shr    $0x2,%ecx
       rep stos %eax,%es:(%edi) &lt;-- the fault

   So %ecx has been loaded from %edx... which is 0x6b6b6b6b/POISON_FREE.
   (0x6b6b6b6b &gt;&gt; 2 == 0x1adadada.)

   %ecx is the counter for the memset, from here:

       memset(object, 0, c-&gt;objsize);

  i.e. %ecx was loaded from c-&gt;objsize, so "c" must have been freed.
  Where did "c" come from? Uh-oh...

       c = get_cpu_slab(s, smp_processor_id());

  This looks like it has very much to do with CPU hotplug/unplug. Is
  there a race between SLUB/hotplug since the CPU slab is used after it
  has been freed?"

Good analysis.

Yeah, it's possible that a caller of kmem_cache_alloc() -&gt; slab_alloc()
can be migrated on another CPU right after local_irq_restore() and
before memset().  The inital cpu can become offline in the mean time (or
a migration is a consequence of the CPU going offline) so its
'kmem_cache_cpu' structure gets freed ( slab_cpuup_callback).

At some point of time the caller continues on another CPU having an
obsolete pointer...

Signed-off-by: Dmitry Adamushko &lt;dmitry.adamushko@gmail.com&gt;
Reported-by: Vegard Nossum &lt;vegard.nossum@gmail.com&gt;
Acked-by: Ingo Molnar &lt;mingo@elte.hu&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit bdb21928512a860a60e6a24a849dc5b63cbaf96a upstream

Vegard Nossum reported a crash in kmem_cache_alloc():

	BUG: unable to handle kernel paging request at da87d000
	IP: [&lt;c01991c7&gt;] kmem_cache_alloc+0xc7/0xe0
	*pde = 28180163 *pte = 1a87d160
	Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
	Pid: 3850, comm: grep Not tainted (2.6.26-rc9-00059-gb190333 #5)
	EIP: 0060:[&lt;c01991c7&gt;] EFLAGS: 00210203 CPU: 0
	EIP is at kmem_cache_alloc+0xc7/0xe0
	EAX: 00000000 EBX: da87c100 ECX: 1adad71a EDX: 6b6b6b6b
	ESI: 00200282 EDI: da87d000 EBP: f60bfe74 ESP: f60bfe54
	DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068

and analyzed it:

  "The register %ecx looks innocent but is very important here. The disassembly:

       mov    %edx,%ecx
       shr    $0x2,%ecx
       rep stos %eax,%es:(%edi) &lt;-- the fault

   So %ecx has been loaded from %edx... which is 0x6b6b6b6b/POISON_FREE.
   (0x6b6b6b6b &gt;&gt; 2 == 0x1adadada.)

   %ecx is the counter for the memset, from here:

       memset(object, 0, c-&gt;objsize);

  i.e. %ecx was loaded from c-&gt;objsize, so "c" must have been freed.
  Where did "c" come from? Uh-oh...

       c = get_cpu_slab(s, smp_processor_id());

  This looks like it has very much to do with CPU hotplug/unplug. Is
  there a race between SLUB/hotplug since the CPU slab is used after it
  has been freed?"

Good analysis.

Yeah, it's possible that a caller of kmem_cache_alloc() -&gt; slab_alloc()
can be migrated on another CPU right after local_irq_restore() and
before memset().  The inital cpu can become offline in the mean time (or
a migration is a consequence of the CPU going offline) so its
'kmem_cache_cpu' structure gets freed ( slab_cpuup_callback).

At some point of time the caller continues on another CPU having an
obsolete pointer...

Signed-off-by: Dmitry Adamushko &lt;dmitry.adamushko@gmail.com&gt;
Reported-by: Vegard Nossum &lt;vegard.nossum@gmail.com&gt;
Acked-by: Ingo Molnar &lt;mingo@elte.hu&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>Fix ZERO_PAGE breakage with vmware</title>
<updated>2008-06-24T21:08:31+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2008-06-23T18:21:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=a3abcfa827d9bf769829d3ca0036a12a7a058a02'/>
<id>a3abcfa827d9bf769829d3ca0036a12a7a058a02</id>
<content type='text'>
commit 672ca28e300c17bf8d792a2a7a8631193e580c74 upstream

Commit 89f5b7da2a6bad2e84670422ab8192382a5aeb9f ("Reinstate ZERO_PAGE
optimization in 'get_user_pages()' and fix XIP") broke vmware, as
reported by Jeff Chua:

  "This broke vmware 6.0.4.
   Jun 22 14:53:03.845: vmx| NOT_IMPLEMENTED
   /build/mts/release/bora-93057/bora/vmx/main/vmmonPosix.c:774"

and the reason seems to be that there's an old bug in how we handle do
FOLL_ANON on VM_SHARED areas in get_user_pages(), but since it only
triggered if the whole page table was missing, nobody had apparently hit
it before.

The recent changes to 'follow_page()' made the FOLL_ANON logic trigger
not just for whole missing page tables, but for individual pages as
well, and exposed this problem.

This fixes it by making the test for when FOLL_ANON is used more
careful, and also makes the code easier to read and understand by moving
the logic to a separate inline function.

Reported-and-tested-by: Jeff Chua &lt;jeff.chua.linux@gmail.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 672ca28e300c17bf8d792a2a7a8631193e580c74 upstream

Commit 89f5b7da2a6bad2e84670422ab8192382a5aeb9f ("Reinstate ZERO_PAGE
optimization in 'get_user_pages()' and fix XIP") broke vmware, as
reported by Jeff Chua:

  "This broke vmware 6.0.4.
   Jun 22 14:53:03.845: vmx| NOT_IMPLEMENTED
   /build/mts/release/bora-93057/bora/vmx/main/vmmonPosix.c:774"

and the reason seems to be that there's an old bug in how we handle do
FOLL_ANON on VM_SHARED areas in get_user_pages(), but since it only
triggered if the whole page table was missing, nobody had apparently hit
it before.

The recent changes to 'follow_page()' made the FOLL_ANON logic trigger
not just for whole missing page tables, but for individual pages as
well, and exposed this problem.

This fixes it by making the test for when FOLL_ANON is used more
careful, and also makes the code easier to read and understand by moving
the logic to a separate inline function.

Reported-and-tested-by: Jeff Chua &lt;jeff.chua.linux@gmail.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>Add return value to reserve_bootmem_node()</title>
<updated>2008-06-24T21:08:29+00:00</updated>
<author>
<name>Bernhard Walle</name>
<email>bwalle@suse.de</email>
</author>
<published>2008-06-21T17:01:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=8c3a10753bfb7ba66b024398cc9bad9eea82a99f'/>
<id>8c3a10753bfb7ba66b024398cc9bad9eea82a99f</id>
<content type='text'>
commit 71c2742f5e6348d76ee62085cf0a13e5eff0f00e upstream

This patch changes the function reserve_bootmem_node() from void to int,
returning -ENOMEM if the allocation fails.

This fixes a build problem on x86 with CONFIG_KEXEC=y and
CONFIG_NEED_MULTIPLE_NODES=y

Signed-off-by: Bernhard Walle &lt;bwalle@suse.de&gt;
Reported-by: Adrian Bunk &lt;bunk@kernel.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 71c2742f5e6348d76ee62085cf0a13e5eff0f00e upstream

This patch changes the function reserve_bootmem_node() from void to int,
returning -ENOMEM if the allocation fails.

This fixes a build problem on x86 with CONFIG_KEXEC=y and
CONFIG_NEED_MULTIPLE_NODES=y

Signed-off-by: Bernhard Walle &lt;bwalle@suse.de&gt;
Reported-by: Adrian Bunk &lt;bunk@kernel.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>Reinstate ZERO_PAGE optimization in 'get_user_pages()' and fix XIP</title>
<updated>2008-06-24T21:08:29+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2008-06-20T18:18:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=489cf6b0ade32067774b89e8e822a012d3f3bb96'/>
<id>489cf6b0ade32067774b89e8e822a012d3f3bb96</id>
<content type='text'>
commit 89f5b7da2a6bad2e84670422ab8192382a5aeb9f upstream

KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit
557ed1fa2620dc119adb86b34c614e152a629a80 ("remove ZERO_PAGE") removed
the ZERO_PAGE from the VM mappings, any users of get_user_pages() will
generally now populate the VM with real empty pages needlessly.

We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but
since fault handling no longer uses ZERO_PAGE for new anonymous pages,
we now need to handle that special case in follow_page() instead.

In particular, the removal of ZERO_PAGE effectively removed the core
file writing optimization where we would skip writing pages that had not
been populated at all, and increased memory pressure a lot by allocating
all those useless newly zeroed pages.

This reinstates the optimization by making the unmapped PTE case the
same as for a non-existent page table, which already did this correctly.

While at it, this also fixes the XIP case for follow_page(), where the
caller could not differentiate between the case of a page that simply
could not be used (because it had no "struct page" associated with it)
and a page that just wasn't mapped.

We do that by simply returning an error pointer for pages that could not
be turned into a "struct page *".  The error is arbitrarily picked to be
EFAULT, since that was what get_user_pages() already used for the
equivalent IO-mapped page case.

[ Also removed an impossible test for pte_offset_map_lock() failing:
  that's not how that function works ]

Acked-by: Oleg Nesterov &lt;oleg@tv-sign.ru&gt;
Acked-by: Nick Piggin &lt;npiggin@suse.de&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Hugh Dickins &lt;hugh@veritas.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Cc: Roland McGrath &lt;roland@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 89f5b7da2a6bad2e84670422ab8192382a5aeb9f upstream

KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit
557ed1fa2620dc119adb86b34c614e152a629a80 ("remove ZERO_PAGE") removed
the ZERO_PAGE from the VM mappings, any users of get_user_pages() will
generally now populate the VM with real empty pages needlessly.

We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but
since fault handling no longer uses ZERO_PAGE for new anonymous pages,
we now need to handle that special case in follow_page() instead.

In particular, the removal of ZERO_PAGE effectively removed the core
file writing optimization where we would skip writing pages that had not
been populated at all, and increased memory pressure a lot by allocating
all those useless newly zeroed pages.

This reinstates the optimization by making the unmapped PTE case the
same as for a non-existent page table, which already did this correctly.

While at it, this also fixes the XIP case for follow_page(), where the
caller could not differentiate between the case of a page that simply
could not be used (because it had no "struct page" associated with it)
and a page that just wasn't mapped.

We do that by simply returning an error pointer for pages that could not
be turned into a "struct page *".  The error is arbitrarily picked to be
EFAULT, since that was what get_user_pages() already used for the
equivalent IO-mapped page case.

[ Also removed an impossible test for pte_offset_map_lock() failing:
  that's not how that function works ]

Acked-by: Oleg Nesterov &lt;oleg@tv-sign.ru&gt;
Acked-by: Nick Piggin &lt;npiggin@suse.de&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Hugh Dickins &lt;hugh@veritas.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Cc: Roland McGrath &lt;roland@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</pre>
</div>
</content>
</entry>
</feed>
