linux-toradex.git/mm/memory.c, branch v2.6.25.2

memcg: when do_swap's do_wp_page fails

2008-03-05T00:35:14+00:00

Don't uncharge when do_swap_page's call to do_wp_page fails: the page which
was charged for is there in the pagetable, and will be correctly uncharged
when that area is unmapped - it was only its COWing which failed.

And while we're here, remove earlier XXX comment: yes, OR in do_wp_page's
return value (maybe VM_FAULT_WRITE) with do_swap_page's there; but if it
fails, mask out success bits, which might confuse some arches e.g.  sparc.

Signed-off-by: Hugh Dickins 
Cc: David Rientjes 
Acked-by: Balbir Singh 
Acked-by: KAMEZAWA Hiroyuki 
Cc: Hirokazu Takahashi 
Cc: YAMAMOTO Takashi 
Cc: Paul Menage 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: page_cache_release not __free_page

2008-03-05T00:35:14+00:00

There's nothing wrong with mem_cgroup_charge failure in do_wp_page and
do_anonymous page using __free_page, but it does look odd when nearby code
uses page_cache_release: use that instead (while turning a blind eye to
ancient inconsistencies of page_cache_release versus put_page).

Signed-off-by: Hugh Dickins 
Cc: David Rientjes 
Acked-by: Balbir Singh 
Acked-by: KAMEZAWA Hiroyuki 
Cc: Hirokazu Takahashi 
Cc: YAMAMOTO Takashi 
Cc: Paul Menage 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86

2008-02-15T05:23:19+00:00

* git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86:
  x86: cpa, fix out of date comment
  KVM is not seen under X86 config with latest git (32 bit compile)
  x86: cpa: ensure page alignment
  x86: include proper prototypes for rodata_test
  x86: fix gart_iommu_init()
  x86: EFI set_memory_x()/set_memory_uc() fixes
  x86: make dump_pagetable() static
  x86: fix "BUG: sleeping function called from invalid context" in print_vma_addr()

d_path: Make d_path() use a struct path

2008-02-15T05:17:09+00:00

d_path() is used on a  pair.  Lets use a struct path to
reflect this.

[akpm@linux-foundation.org: fix build in mm/memory.c]
Signed-off-by: Jan Blunck 
Acked-by: Bryan Wu 
Acked-by: Christoph Hellwig 
Cc: Al Viro 
Cc: "J. Bruce Fields" 
Cc: Neil Brown 
Cc: Michael Halcrow 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

x86: fix "BUG: sleeping function called from invalid context" in print_vma_addr()

2008-02-14T22:30:19+00:00

Jiri Kosina reported the following deadlock scenario with
show_unhandled_signals enabled:

 [   68.379022] gnome-settings-[2941] trap int3 ip:3d2c840f34
 sp:7fff36f5d100 error:0<3>BUG: sleeping function called from invalid
 context at kernel/rwsem.c:21
 [   68.379039] in_atomic():1, irqs_disabled():0
 [   68.379044] no locks held by gnome-settings-/2941.
 [   68.379050] Pid: 2941, comm: gnome-settings- Not tainted 2.6.25-rc1 #30
 [   68.379054]
 [   68.379056] Call Trace:
 [   68.379061]  <#DB>  [] ? __debug_show_held_locks+0x13/0x30
 [   68.379109]  [] __might_sleep+0xe5/0x110
 [   68.379123]  [] down_read+0x20/0x70
 [   68.379137]  [] print_vma_addr+0x3a/0x110
 [   68.379152]  [] do_trap+0xf5/0x170
 [   68.379168]  [] do_int3+0x7b/0xe0
 [   68.379180]  [] int3+0x9f/0xd0
 [   68.379203]  <>
 [   68.379229]  in libglib-2.0.so.0.1505.0[3d2c800000+dc000]

and tracked it down to:

  commit 03252919b79891063cf99145612360efbdf9500b
  Author: Andi Kleen 
  Date:   Wed Jan 30 13:33:18 2008 +0100

      x86: print which shared library/executable faulted in segfault etc. messages

the problem is that we call down_read() from an atomic context.

Solve this by returning from print_vma_addr() if the preempt count is
elevated. Update preempt_conditional_sti / preempt_conditional_cli to
unconditionally lift the preempt count even on !CONFIG_PREEMPT.

Reported-by: Jiri Kosina 
Signed-off-by: Ingo Molnar

Be more robust about bad arguments in get_user_pages()

2008-02-12T04:44:44+00:00

So I spent a while pounding my head against my monitor trying to figure
out the vmsplice() vulnerability - how could a failure to check for
*read* access turn into a root exploit? It turns out that it's a buffer
overflow problem which is made easy by the way get_user_pages() is
coded.

In particular, "len" is a signed int, and it is only checked at the
*end* of a do {} while() loop.  So, if it is passed in as zero, the loop
will execute once and decrement len to -1.  At that point, the loop will
proceed until the next invalid address is found; in the process, it will
likely overflow the pages array passed in to get_user_pages().

I think that, if get_user_pages() has been asked to grab zero pages,
that's what it should do.  Thus this patch; it is, among other things,
enough to block the (already fixed) root exploit and any others which
might be lurking in similar code.  I also think that the number of pages
should be unsigned, but changing the prototype of this function probably
requires some more careful review.

Signed-off-by: Jonathan Corbet 
Signed-off-by: Linus Torvalds

CONFIG_HIGHPTE vs. sub-page page tables.

2008-02-08T17:22:42+00:00

Background: I've implemented 1K/2K page tables for s390.  These sub-page
page tables are required to properly support the s390 virtualization
instruction with KVM.  The SIE instruction requires that the page tables
have 256 page table entries (pte) followed by 256 page status table entries
(pgste).  The pgstes are only required if the process is using the SIE
instruction.  The pgstes are updated by the hardware and by the hypervisor
for a number of reasons, one of them is dirty and reference bit tracking.
To avoid wasting memory the standard pte table allocation should return
1K/2K (31/64 bit) and 2K/4K if the process is using SIE.

Problem: Page size on s390 is 4K, page table size is 1K or 2K.  That means
the s390 version for pte_alloc_one cannot return a pointer to a struct
page.  Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one
cannot return a pointer to a pte either, since that would require more than
32 bit for the return value of pte_alloc_one (and the pte * would not be
accessible since its not kmapped).

Solution: The only solution I found to this dilemma is a new typedef: a
pgtable_t.  For s390 pgtable_t will be a (pte *) - to be introduced with a
later patch.  For everybody else it will be a (struct page *).  The
additional problem with the initialization of the ptl lock and the
NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and
a destructor pgtable_page_dtor.  The page table allocation and free
functions need to call these two whenever a page table page is allocated or
freed.  pmd_populate will get a pgtable_t instead of a struct page pointer.
 To get the pgtable_t back from a pmd entry that has been installed with
pmd_populate a new function pmd_pgtable is added.  It replaces the pmd_page
call in free_pte_range and apply_to_pte_range.

Signed-off-by: Martin Schwidefsky 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Memory controller: make charging gfp mask aware

2008-02-07T16:42:19+00:00

Nick Piggin pointed out that swap cache and page cache addition routines
could be called from non GFP_KERNEL contexts.  This patch makes the
charging routine aware of the gfp context.  Charging might fail if the
cgroup is over it's limit, in which case a suitable error is returned.

This patch was tested on a Powerpc box.  I am still looking at being able
to test the path, through which allocations happen in non GFP_KERNEL
contexts.

[kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE]
Signed-off-by: Balbir Singh 
Cc: Pavel Emelianov 
Cc: Paul Menage 
Cc: Peter Zijlstra 
Cc: "Eric W. Biederman" 
Cc: Nick Piggin 
Cc: Kirill Korotaev 
Cc: Herbert Poetzl 
Cc: David Rientjes 
Cc: Vaidyanathan Srinivasan 
Signed-off-by: KAMEZAWA Hiroyuki 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Memory controller: memory accounting

2008-02-07T16:42:18+00:00

Add the accounting hooks.  The accounting is carried out for RSS and Page
Cache (unmapped) pages.  There is now a common limit and accounting for both.
The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
time.  Page cache is accounted at add_to_page_cache(),
__delete_from_page_cache().  Swap cache is also accounted for.

Each page's page_cgroup is protected with the last bit of the
page_cgroup pointer, this makes handling of race conditions involving
simultaneous mappings of a page easier.  A reference count is kept in the
page_cgroup to deal with cases where a page might be unmapped from the RSS
of all tasks, but still lives in the page cache.

Credits go to Vaidyanathan Srinivasan for helping with reference counting work
of the page cgroup.  Almost all of the page cache accounting code has help
from Vaidyanathan Srinivasan.

[hugh@veritas.com: fix swapoff breakage]
[akpm@linux-foundation.org: fix locking]
Signed-off-by: Vaidyanathan Srinivasan 
Signed-off-by: Balbir Singh 
Cc: Pavel Emelianov 
Cc: Paul Menage 
Cc: Peter Zijlstra 
Cc: "Eric W. Biederman" 
Cc: Nick Piggin 
Cc: Kirill Korotaev 
Cc: Herbert Poetzl 
Cc: David Rientjes 
Cc: 
Signed-off-by: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

brk randomization: introduce CONFIG_COMPAT_BRK

2008-02-06T21:39:44+00:00

based on similar patch from: Pavel Machek 

Introduce CONFIG_COMPAT_BRK. If disabled then the kernel is free
(but not obliged to) randomize the brk area.

Heap randomization breaks ancient binaries, so we keep COMPAT_BRK
enabled by default.

Signed-off-by: Ingo Molnar