linux-toradex.git/include/linux/pagemap.h, branch v2.6.27.14

fs: symlink write_begin allocation context fix

2009-01-18T18:35:43+00:00

commit 54566b2c1594c2326a645a3551f9d989f7ba3c5e upstream.

With the write_begin/write_end aops, page_symlink was broken because it
could no longer pass a GFP_NOFS type mask into the point where the
allocations happened.  They are done in write_begin, which would always
assume that the filesystem can be entered from reclaim.  This bug could
cause filesystem deadlocks.

The funny thing with having a gfp_t mask there is that it doesn't really
allow the caller to arbitrarily tinker with the context in which it can be
called.  It couldn't ever be GFP_ATOMIC, for example, because it needs to
take the page lock.  The only thing any callers care about is __GFP_FS
anyway, so turn that into a single flag.

Add a new flag for write_begin, AOP_FLAG_NOFS.  Filesystems can now act on
this flag in their write_begin function.  Change __grab_cache_page to
accept a nofs argument as well, to honour that flag (while we're there,
change the name to grab_cache_page_write_begin which is more instructive
and does away with random leading underscores).

This is really a more flexible way to go in the end anyway -- if a
filesystem happens to want any extra allocations aside from the pagecache
ones in ints write_begin function, it may now use GFP_KERNEL (rather than
GFP_NOFS) for common case allocations (eg.  ocfs2_alloc_write_ctxt, for a
random example).

[kosaki.motohiro@jp.fujitsu.com: fix ubifs]
[kosaki.motohiro@jp.fujitsu.com: fix fuse]
Signed-off-by: Nick Piggin 
Reviewed-by: KOSAKI Motohiro 
Cc: 
Signed-off-by: KOSAKI Motohiro 
Signed-off-by: Andrew Morton 
[ Cleaned up the calling convention: just pass in the AOP flags
  untouched to the grab_cache_page_write_begin() function.  That
  just simplifies everybody, and may even allow future expansion of the
  logic.   - Linus ]
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: rename page trylock

2008-08-05T04:31:34+00:00

Converting page lock to new locking bitops requires a change of page flag
operation naming, so we might as well convert it to something nicer
(!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

This also facilitates lockdeping of page lock.

Signed-off-by: Nick Piggin 
Acked-by: KOSAKI Motohiro 
Acked-by: Peter Zijlstra 
Acked-by: Andrew Morton 
Acked-by: Benjamin Herrenschmidt 
Signed-off-by: Linus Torvalds

powerpc/mm: Lockless get_user_pages_fast() for 64-bit (v3)

2008-07-30T05:26:54+00:00

Implement lockless get_user_pages_fast for 64-bit powerpc.

Page table existence is guaranteed with RCU, and speculative page references
are used to take a reference to the pages without having a prior existence
guarantee on them.

Signed-off-by: Nick Piggin 
Signed-off-by: Dave Kleikamp 
Signed-off-by: Andrew Morton 
Signed-off-by: Benjamin Herrenschmidt

mmu-notifiers: add mm_take_all_locks() operation

2008-07-28T23:30:21+00:00

mm_take_all_locks holds off reclaim from an entire mm_struct.  This allows
mmu notifiers to register into the mm at any time with the guarantee that
no mmu operation is in progress on the mm.

This operation locks against the VM for all pte/vma/mm related operations
that could ever happen on a certain mm.  This includes vmtruncate,
try_to_unmap, and all page faults.

The caller must take the mmap_sem in write mode before calling
mm_take_all_locks().  The caller isn't allowed to release the mmap_sem
until mm_drop_all_locks() returns.

mmap_sem in write mode is required in order to block all operations that
could modify pagetables and free pages without need of altering the vma
layout (for example populate_range() with nonlinear vmas).  It's also
needed in write mode to avoid new anon_vmas to be associated with existing
vmas.

A single task can't take more than one mm_take_all_locks() in a row or it
would deadlock.

mm_take_all_locks() and mm_drop_all_locks are expensive operations that
may have to take thousand of locks.

mm_take_all_locks() can fail if it's interrupted by signals.

When mmu_notifier_register returns, we must be sure that the driver is
notified if some task is in the middle of a vmtruncate for the 'mm' where
the mmu notifier was registered (mmu_notifier_invalidate_range_start/end
is run around the vmtruncation but mmu_notifier_register can run after
mmu_notifier_invalidate_range_start and before
mmu_notifier_invalidate_range_end).  Same problem for rmap paths.  And
we've to remove page pinning to avoid replicating the tlb_gather logic
inside KVM (and GRU doesn't work well with page pinning regardless of
needing tlb_gather), so without mm_take_all_locks when vmtruncate frees
the page, kvm would have no way to notice that it mapped into sptes a page
that is going into the freelist without a chance of any further
mmu_notifier notification.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrea Arcangeli 
Acked-by: Linus Torvalds 
Cc: Christoph Lameter 
Cc: Jack Steiner 
Cc: Robin Holt 
Cc: Nick Piggin 
Cc: Peter Zijlstra 
Cc: Kanoj Sarcar 
Cc: Roland Dreier 
Cc: Steve Wise 
Cc: Avi Kivity 
Cc: Hugh Dickins 
Cc: Rusty Russell 
Cc: Anthony Liguori 
Cc: Chris Wright 
Cc: Marcelo Tosatti 
Cc: Eric Dumazet 
Cc: "Paul E. McKenney" 
Cc: Izik Eidus 
Cc: Anthony Liguori 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: speculative page references

2008-07-26T19:00:06+00:00

If we can be sure that elevating the page_count on a pagecache page will
pin it, we can speculatively run this operation, and subsequently check to
see if we hit the right page rather than relying on holding a lock or
otherwise pinning a reference to the page.

This can be done if get_page/put_page behaves consistently throughout the
whole tree (ie.  if we "get" the page after it has been used for something
else, we must be able to free it with a put_page).

Actually, there is a period where the count behaves differently: when the
page is free or if it is a constituent page of a compound page.  We need
an atomic_inc_not_zero operation to ensure we don't try to grab the page
in either case.

This patch introduces the core locking protocol to the pagecache (ie.
adds page_cache_get_speculative, and tweaks some update-side code to make
it work).

Thanks to Hugh for pointing out an improvement to the algorithm setting
page_count to zero when we have control of all references, in order to
hold off speculative getters.

[kamezawa.hiroyu@jp.fujitsu.com: fix migration_entry_wait()]
[hugh@veritas.com: fix add_to_page_cache]
[akpm@linux-foundation.org: repair a comment]
Signed-off-by: Nick Piggin 
Cc: Jeff Garzik 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Hugh Dickins 
Cc: "Paul E. McKenney" 
Reviewed-by: Peter Zijlstra 
Signed-off-by: Daisuke Nishimura 
Signed-off-by: KAMEZAWA Hiroyuki 
Signed-off-by: KOSAKI Motohiro 
Signed-off-by: Hugh Dickins 
Acked-by: Nick Piggin 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mapping_set_error: add unlikely()

2008-07-24T17:47:15+00:00

This is called on a per-page basis and in the vast majority of cases
`error' is zero.

Cc: Guillaume Chazarain 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

include/linux: Remove all users of FASTCALL() macro

2008-02-14T00:21:18+00:00

FASTCALL() is always expanded to empty, remove it.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Harvey Harrison 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Add lock_page_killable

2007-12-06T22:35:41+00:00

This routine is like lock_page, but can be interrupted by a fatal signal

Signed-off-by: Matthew Wilcox

fs: introduce write_begin, write_end, and perform_write aops

2007-10-16T16:42:55+00:00

These are intended to replace prepare_write and commit_write with more
flexible alternatives that are also able to avoid the buffered write
deadlock problems efficiently (which prepare_write is unable to do).

[mark.fasheh@oracle.com: API design contributions, code review and fixes]
[akpm@linux-foundation.org: various fixes]
[dmonakhov@sw.ru: new aop block_write_begin fix]
Signed-off-by: Nick Piggin 
Signed-off-by: Mark Fasheh 
Signed-off-by: Dmitriy Monakhov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: fix pagecache write deadlocks

2007-10-16T16:42:54+00:00

Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1.  generic_buffered_write
2.   lock_page
3.    prepare_write
4.     unlock_page+vmtruncate
5.     copy_from_user
6.      mmap_sem(r)
7.       handle_mm_fault
8.        lock_page (filemap_nopage)
9.    commit_write
10.  unlock_page

a. sys_munmap / sys_mlock / others
b.  mmap_sem(w)
c.   make_pages_present
d.    get_user_pages
e.     handle_mm_fault
f.      lock_page (filemap_nopage)

2,8	- recursive deadlock if page is same
2,8;2,8	- ABBA deadlock is page is different
2,6;b,f	- ABBA deadlock if page is same

The solution is as follows:
1.  If we find the destination page is uptodate, continue as normal, but use
    atomic usercopies which do not take pagefaults and do not zero the uncopied
    tail of the destination. The destination is already uptodate, so we can
    commit_write the full length even if there was a partial copy: it does not
    matter that the tail was not modified, because if it is dirtied and written
    back to disk it will not cause any problems (uptodate *means* that the
    destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
    information, because atomic usercopies cannot distinguish between
    non-present pages in a readable mapping, from lack of a readable mapping.

2.  If we find the destination page is non uptodate, unlock it (this could be
    made slightly more optimal), then allocate a temporary page to copy the
    source data into. Relock the destination page and continue with the copy.
    However, instead of a usercopy (which might take a fault), copy the data
    from the pinned temporary page via the kernel address space.

(also, rename maxlen to seglen, because it was confusing)

This increases the CPU/memory copy cost by almost 50% on the affected
workloads. That will be solved by introducing a new set of pagecache write
aops in a subsequent patch.

Signed-off-by: Nick Piggin 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds