linux-toradex.git/include, branch v3.3.6

hugepages: fix use after free bug in "quota" handling

2012-05-12T16:32:21+00:00

commit 90481622d75715bfcb68501280a917dbfe516029 upstream.

hugetlbfs_{get,put}_quota() are badly named.  They don't interact with the
general quota handling code, and they don't much resemble its behaviour.
Rather than being about maintaining limits on on-disk block usage by
particular users, they are instead about maintaining limits on in-memory
page usage (including anonymous MAP_PRIVATE copied-on-write pages)
associated with a particular hugetlbfs filesystem instance.

Worse, they work by having callbacks to the hugetlbfs filesystem code from
the low-level page handling code, in particular from free_huge_page().
This is a layering violation of itself, but more importantly, if the
kernel does a get_user_pages() on hugepages (which can happen from KVM
amongst others), then the free_huge_page() can be delayed until after the
associated inode has already been freed.  If an unmount occurs at the
wrong time, even the hugetlbfs superblock where the "quota" limits are
stored may have been freed.

Andrew Barry proposed a patch to fix this by having hugepages, instead of
storing a pointer to their address_space and reaching the superblock from
there, had the hugepages store pointers directly to the superblock,
bumping the reference count as appropriate to avoid it being freed.
Andrew Morton rejected that version, however, on the grounds that it made
the existing layering violation worse.

This is a reworked version of Andrew's patch, which removes the extra, and
some of the existing, layering violation.  It works by introducing the
concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
finite logical pool of hugepages to allocate from.  hugetlbfs now creates
a subpool for each filesystem instance with a page limit set, and a
pointer to the subpool gets added to each allocated hugepage, instead of
the address_space pointer used now.  The subpool has its own lifetime and
is only freed once all pages in it _and_ all other references to it (i.e.
superblocks) are gone.

subpools are optional - a NULL subpool pointer is taken by the code to
mean that no subpool limits are in effect.

Previous discussion of this bug found in:  "Fix refcounting in hugetlbfs
quota handling.". See:  https://lkml.org/lkml/2011/8/11/28 or
http://marc.info/?l=linux-mm&m=126928970510627&w=1

v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
alloc_huge_page() - since it already takes the vma, it is not necessary.

Signed-off-by: Andrew Barry 
Signed-off-by: David Gibson 
Cc: Hugh Dickins 
Cc: Mel Gorman 
Cc: Minchan Kim 
Cc: Hillf Danton 
Cc: Paul Mackerras 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

KVM: Ensure all vcpus are consistent with in-kernel irqchip settings

2012-05-12T16:32:20+00:00

(cherry picked from commit 3e515705a1f46beb1c942bb8043c16f8ac7b1e9e)

If some vcpus are created before KVM_CREATE_IRQCHIP, then
irqchip_in_kernel() and vcpu->arch.apic will be inconsistent, leading
to potential NULL pointer dereferences.

Fix by:
- ensuring that no vcpus are installed when KVM_CREATE_IRQCHIP is called
- ensuring that a vcpu has an apic if it is installed after KVM_CREATE_IRQCHIP

This is somewhat long winded because vcpu->arch.apic is created without
kvm->lock held.

Based on earlier patch by Michael Ellerman.

Signed-off-by: Michael Ellerman 
Signed-off-by: Avi Kivity 
Signed-off-by: Greg Kroah-Hartman

net: Fix issue with netdev_tx_reset_queue not resetting queue from XOFF state

2012-05-12T16:32:19+00:00

[ Upstream commit 5c4903549c05bbb373479e0ce2992573c120654a ]

We are seeing dev_watchdog hangs on several drivers.  I suspect this is due
to the __QUEUE_STATE_STACK_XOFF bit being set prior to a reset for link
change, and then not being cleared by netdev_tx_reset_queue.  This change
corrects that.

In addition we were seeing dev_watchdog hangs on igb after running the
ethtool tests.  We found this to be due to the fact that the ethtool test
runs the same logic as ndo_start_xmit, but we were never clearing the XOFF
flag since the loopback test in ethtool does not do byte queue accounting.

Signed-off-by: Alexander Duyck 
Tested-by: Stephen Ko 
Signed-off-by: Jeff Kirsher 
Signed-off-by: Greg Kroah-Hartman

net: Add memory barriers to prevent possible race in byte queue limits

2012-05-12T16:32:19+00:00

[ Upstream commit b37c0fbe3f6dfba1f8ad2aed47fb40578a254635 ]

This change adds a memory barrier to the byte queue limit code to address a
possible race as has been seen in the past with the
netif_stop_queue/netif_wake_queue logic.

Signed-off-by: Alexander Duyck 
Tested-by: Stephen Ko 
Signed-off-by: Jeff Kirsher 
Signed-off-by: Greg Kroah-Hartman

Fix __read_seqcount_begin() to use ACCESS_ONCE for sequence value read

2012-05-12T16:32:05+00:00

commit 2f624278626677bfaf73fef97f86b37981621f5c upstream.

We really need to use a ACCESS_ONCE() on the sequence value read in
__read_seqcount_begin(), because otherwise the compiler might end up
reloading the value in between the test and the return of it.  As a
result, it might end up returning an odd value (which means that a write
is in progress).

If the reader is then fast enough that that odd value is still the
current one when the read_seqcount_retry() is done, we might end up with
a "successful" read sequence, even despite the concurrent write being
active.

In practice this probably never really happens - there just isn't
anything else going on around the read of the sequence count, and the
common case is that we end up having a read barrier immediately
afterwards.

So the code sequence in which gcc might decide to reaload from memory is
small, and there's no reason to believe it would ever actually do the
reload.  But if the compiler ever were to decide to do so, it would be
incredibly annoying to debug.  Let's just make sure.

Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

asm-generic: Use __BITS_PER_LONG in statfs.h

2012-05-12T16:32:05+00:00

commit f5c2347ee20a8d6964d6a6b1ad04f200f8d4dfa7 upstream.

 is exported to userspace, so using
BITS_PER_LONG is invalid.  We need to use __BITS_PER_LONG instead.

This is kernel bugzilla 43165.

Reported-by: H.J. Lu 
Signed-off-by: H. Peter Anvin 
Link: http://lkml.kernel.org/r/1335465916-16965-1-git-send-email-hpa@linux.intel.com
Acked-by: Arnd Bergmann 
Signed-off-by: Greg Kroah-Hartman

efi: Add new variable attributes

2012-05-07T15:53:34+00:00

commit 41b3254c93acc56adc3c4477fef7c9512d47659e upstream.

More recent versions of the UEFI spec have added new attributes for
variables. Add them.

Signed-off-by: Matthew Garrett 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

pipes: add a "packetized pipe" mode for writing

2012-05-07T15:53:23+00:00

commit 9883035ae7edef3ec62ad215611cb8e17d6a1a5d upstream.

The actual internal pipe implementation is already really about
individual packets (called "pipe buffers"), and this simply exposes that
as a special packetized mode.

When we are in the packetized mode (marked by O_DIRECT as suggested by
Alan Cox), a write() on a pipe will not merge the new data with previous
writes, so each write will get a pipe buffer of its own.  The pipe
buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn
will tell the reader side to break the read at that boundary (and throw
away any partial packet contents that do not fit in the read buffer).

End result: as long as you do writes less than PIPE_BUF in size (so that
the pipe doesn't have to split them up), you can now treat the pipe as a
packet interface, where each read() system call will read one packet at
a time.  You can just use a sufficiently big read buffer (PIPE_BUF is
sufficient, since bigger than that doesn't guarantee atomicity anyway),
and the return value of the read() will naturally give you the size of
the packet.

NOTE! We do not support zero-sized packets, and zero-sized reads and
writes to a pipe continue to be no-ops.  Also note that big packets will
currently be split at write time, but that the size at which that
happens is not really specified (except that it's bigger than PIPE_BUF).
Currently that limit is the system page size, but we might want to
explicitly support bigger packets some day.

The main user for this is going to be the autofs packet interface,
allowing us to stop having to care so deeply about exact packet sizes
(which have had bugs with 32/64-bit compatibility modes).  But user
space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will
fail with an EINVAL on kernels that do not support this interface.

Tested-by: Michael Tokarev 
Cc: Alan Cox 
Cc: David Miller 
Cc: Ian Kent 
Cc: Thomas Meyer 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

USB: EHCI: fix crash during suspend on ASUS computers

2012-05-07T15:53:23+00:00

commit 151b61284776be2d6f02d48c23c3625678960b97 upstream.

This patch (as1545) fixes a problem affecting several ASUS computers:
The machine crashes or corrupts memory when going into suspend if the
ehci-hcd driver is bound to any controllers.  Users have been forced
to unbind or unload ehci-hcd before putting their systems to sleep.

After extensive testing, it was determined that the machines don't
like going into suspend when any EHCI controllers are in the PCI D3
power state.  Presumably this is a firmware bug, but there's nothing
we can do about it except to avoid putting the controllers in D3
during system sleep.

The patch adds a new flag to indicate whether the problem is present,
and avoids changing the controller's power state if the flag is set.
Runtime suspend is unaffected; this matters only for system suspend.
However as a side effect, the controller will not respond to remote
wakeup requests while the system is asleep.  Hence USB wakeup is not
functional -- but of course, this is already true in the current state
of affairs.

This fixes Bugzilla #42728.

Signed-off-by: Alan Stern 
Tested-by: Steven Rostedt 
Tested-by: Andrey Rahmatullin 
Tested-by: Oleksij Rempel (fishor) 
Signed-off-by: Greg Kroah-Hartman

tcp: avoid order-1 allocations on wifi and tx path

2012-04-27T17:17:06+00:00

[ This combines upstream commit
  a21d45726acacc963d8baddf74607d9b74e2b723 and the follow-on bug fix
  commit a21d45726acacc963d8baddf74607d9b74e2b723 ]

Marc Merlin reported many order-1 allocations failures in TX path on its
wireless setup, that dont make any sense with MTU=1500 network, and non
SG capable hardware.

After investigation, it turns out TCP uses sk_stream_alloc_skb() and
used as a convention skb_tailroom(skb) to know how many bytes of data
payload could be put in this skb (for non SG capable devices)

Note : these skb used kmalloc-4096 (MTU=1500 + MAX_HEADER +
sizeof(struct skb_shared_info) being above 2048)

Later, mac80211 layer need to add some bytes at the tail of skb
(IEEE80211_ENCRYPT_TAILROOM = 18 bytes) and since no more tailroom is
available has to call pskb_expand_head() and request order-1
allocations.

This patch changes sk_stream_alloc_skb() so that only
sk->sk_prot->max_header bytes of headroom are reserved, and use a new
skb field, avail_size to hold the data payload limit.

This way, order-0 allocations done by TCP stack can leave more than 2 KB
of tailroom and no more allocation is performed in mac80211 layer (or
any layer needing some tailroom)

avail_size is unioned with mark/dropcount, since mark will be set later
in IP stack for output packets. Therefore, skb size is unchanged.

Reported-by: Marc MERLIN 
Tested-by: Marc MERLIN 
Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman