linux-toradex.git/include/linux, branch v3.13.9

cgroup: protect modifications to cgroup_idr with cgroup_mutex

2014-04-03T19:02:38+00:00

commit 0ab02ca8f887908152d1a96db5130fc661d36a1e upstream.

Setup cgroupfs like this:
  # mount -t cgroup -o cpuacct xxx /cgroup
  # mkdir /cgroup/sub1
  # mkdir /cgroup/sub2

Then run these two commands:
  # for ((; ;)) { mkdir /cgroup/sub1/tmp && rmdir /mnt/sub1/tmp; } &
  # for ((; ;)) { mkdir /cgroup/sub2/tmp && rmdir /mnt/sub2/tmp; } &

After seconds you may see this warning:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 25243 at lib/idr.c:527 sub_remove+0x87/0x1b0()
idr_remove called for id=6 which is not allocated.
...
Call Trace:
 [] dump_stack+0x7a/0x96
 [] warn_slowpath_common+0x8c/0xc0
 [] warn_slowpath_fmt+0x46/0x50
 [] sub_remove+0x87/0x1b0
 [] ? css_killed_work_fn+0x32/0x1b0
 [] idr_remove+0x25/0xd0
 [] cgroup_destroy_css_killed+0x5b/0xc0
 [] css_killed_work_fn+0x130/0x1b0
 [] process_one_work+0x26c/0x550
 [] worker_thread+0x12e/0x3b0
 [] kthread+0xe6/0xf0
 [] ret_from_fork+0x7c/0xb0
---[ end trace 2d1577ec10cf80d0 ]---

It's because allocating/removing cgroup ID is not properly synchronized.

The bug was introduced when we converted cgroup_ida to cgroup_idr.
While synchronization is already done inside ida_simple_{get,remove}(),
users are responsible for concurrent calls to idr_{alloc,remove}().

tj: Refreshed on top of b58c89986a77 ("cgroup: fix error return from
cgroup_create()").

[mhocko@suse.cz: ported to 3.12]
Fixes: 4e96ee8e981b ("cgroup: convert cgroup_ida to cgroup_idr")
Cc:  #3.12+
Reported-by: Michal Hocko 
Signed-off-by: Li Zefan 
Signed-off-by: Michal Hocko 
Signed-off-by: Jiri Slaby 
Signed-off-by: Greg Kroah-Hartman

mm: close PageTail race

2014-04-03T19:02:37+00:00

commit 668f9abbd4334e6c29fa8acd71635c4f9101caa7 upstream.

Commit bf6bddf1924e ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).

This results in a very rare NULL pointer dereference on the
aforementioned page_count(page).  Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.

This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation.  This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling.  The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.

This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.

Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.

Signed-off-by: David Rientjes 
Cc: Holger Kiehl 
Cc: Christoph Lameter 
Cc: Rafael Aquini 
Cc: Vlastimil Babka 
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Rik van Riel 
Cc: "Kirill A. Shutemov" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

ext4: atomically set inode->i_flags in ext4_set_inode_flags()

2014-04-03T19:02:34+00:00

commit 00a1a053ebe5febcfc2ec498bd894f035ad2aa06 upstream.

Use cmpxchg() to atomically set i_flags instead of clearing out the
S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the
EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race
where an immutable file has the immutable flag cleared for a brief
window of time.

Reported-by: John Sullivan 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

libceph: rename ceph_msg::front_max to front_alloc_len

2014-03-31T17:05:16+00:00

commit 3cea4c3071d4e55e9d7356efe9d0ebf92f0c2204 upstream.

Rename front_max field of struct ceph_msg to front_alloc_len to make
its purpose more clear.

Signed-off-by: Ilya Dryomov 
Reviewed-by: Sage Weil 
Signed-off-by: Greg Kroah-Hartman

libceph: block I/O when PAUSE or FULL osd map flags are set

2014-03-31T17:05:14+00:00

commit d29adb34a94715174c88ca93e8aba955850c9bde upstream.

The PAUSEWR and PAUSERD flags are meant to stop the cluster from
processing writes and reads, respectively. The FULL flag is set when
the cluster determines that it is out of space, and will no longer
process writes.  PAUSEWR and PAUSERD are purely client-side settings
already implemented in userspace clients. The osd does nothing special
with these flags.

When the FULL flag is set, however, the osd responds to all writes
with -ENOSPC. For cephfs, this makes sense, but for rbd the block
layer translates this into EIO.  If a cluster goes from full to
non-full quickly, a filesystem on top of rbd will not behave well,
since some writes succeed while others get EIO.

Fix this by blocking any writes when the FULL flag is set in the osd
client. This is the same strategy used by userspace, so apply it by
default.  A follow-on patch makes this configurable.

__map_request() is called to re-target osd requests in case the
available osds changed.  Add a paused field to a ceph_osd_request, and
set it whenever an appropriate osd map flag is set.  Avoid queueing
paused requests in __map_request(), but force them to be resent if
they become unpaused.

Also subscribe to the next osd map from the monitor if any of these
flags are set, so paused requests can be unblocked as soon as
possible.

Fixes: http://tracker.ceph.com/issues/6079

Reviewed-by: Sage Weil 
Signed-off-by: Josh Durgin 
Signed-off-by: Greg Kroah-Hartman

tracing: Fix array size mismatch in format string

2014-03-31T17:05:14+00:00

commit 87291347c49dc40aa339f587b209618201c2e527 upstream.

In event format strings, the array size is reported in two locations.
One in array subscript and then via the "size:" attribute. The values
reported there have a mismatch.

For e.g., in sched:sched_switch the prev_comm and next_comm character
arrays have subscript values as [32] where as the actual field size is
16.

name: sched_switch
ID: 301
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1;signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:char prev_comm[32];       offset:8;       size:16;        signed:1;
        field:pid_t prev_pid;   offset:24;      size:4; signed:1;
        field:int prev_prio;    offset:28;      size:4; signed:1;
        field:long prev_state;  offset:32;      size:8; signed:1;
        field:char next_comm[32];       offset:40;      size:16;        signed:1;
        field:pid_t next_pid;   offset:56;      size:4; signed:1;
        field:int next_prio;    offset:60;      size:4; signed:1;

After bisection, the following commit was blamed:
92edca0 tracing: Use direct field, type and system names

This commit removes the duplication of strings for field->name and
field->type assuming that all the strings passed in
__trace_define_field() are immutable. This is not true for arrays, where
the type string is created in event_storage variable and field->type for
all array fields points to event_storage.

Use __stringify() to create a string constant for the type string.

Also, get rid of event_storage and event_storage_mutex that are not
needed anymore.

also, an added benefit is that this reduces the overhead of events a bit more:

   text    data     bss     dec     hex filename
8424787 2036472 1302528 11763787         b3804b vmlinux
8420814 2036408 1302528 11759750         b37086 vmlinux.patched

Link: http://lkml.kernel.org/r/1392349908-29685-1-git-send-email-vnagarnaik@google.com

Cc: Laurent Chavey 
Signed-off-by: Vaibhav Nagarnaik 
Signed-off-by: Steven Rostedt 
Signed-off-by: Greg Kroah-Hartman

NFSv4: Fix another nfs4_sequence corruptor

2014-03-24T04:44:11+00:00

commit b7e63a1079b266866a732cf699d8c4d61391bbda upstream.

nfs4_release_lockowner needs to set the rpc_message reply to point to
the nfs4_sequence_res in order to avoid another Oopsable situation
in nfs41_assign_slot.

Fixes: fbd4bfd1d9d21 (NFS: Add nfs4_sequence calls for RELEASE_LOCKOWNER)
Signed-off-by: Trond Myklebust 
Signed-off-by: Greg Kroah-Hartman

firewire: don't use PREPARE_DELAYED_WORK

2014-03-24T04:44:10+00:00

commit 70044d71d31d6973665ced5be04ef39ac1c09a48 upstream.

PREPARE_[DELAYED_]WORK() are being phased out.  They have few users
and a nasty surprise in terms of reentrancy guarantee as workqueue
considers work items to be different if they don't have the same work
function.

firewire core-device and sbp2 have been been multiplexing work items
with multiple work functions.  Introduce fw_device_workfn() and
sbp2_lu_workfn() which invoke fw_device->workfn and
sbp2_logical_unit->workfn respectively and always use the two
functions as the work functions and update the users to set the
->workfn fields instead of overriding work functions using
PREPARE_DELAYED_WORK().

This fixes a variety of possible regressions since a2c1c57be8d9
"workqueue: consider work function when searching for busy work items"
due to which fw_workqueue lost its required non-reentrancy property.

Signed-off-by: Tejun Heo 
Acked-by: Stefan Richter 
Cc: linux1394-devel@lists.sourceforge.net
Signed-off-by: Greg Kroah-Hartman

tracing: Do not add event files for modules that fail tracepoints

2014-03-24T04:44:10+00:00

commit 45ab2813d40d88fc575e753c38478de242d03f88 upstream.

If a module fails to add its tracepoints due to module tainting, do not
create the module event infrastructure in the debugfs directory. As the events
will not work and worse yet, they will silently fail, making the user wonder
why the events they enable do not display anything.

Having a warning on module load and the events not visible to the users
will make the cause of the problem much clearer.

Link: http://lkml.kernel.org/r/20140227154923.265882695@goodmis.org

Fixes: 6d723736e472 "tracing/events: add support for modules to TRACE_EVENT"
Acked-by: Mathieu Desnoyers 
Cc: Rusty Russell 
Signed-off-by: Steven Rostedt 
Signed-off-by: Greg Kroah-Hartman

mm: include VM_MIXEDMAP flag in the VM_SPECIAL list to avoid m(un)locking

2014-03-24T04:43:58+00:00

commit 9050d7eba40b3d79551668f54e68fd6f51945ef3 upstream.

Daniel Borkmann reported a VM_BUG_ON assertion failing:

  ------------[ cut here ]------------
  kernel BUG at mm/mlock.c:528!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: ccm arc4 iwldvm [...]
   video
  CPU: 3 PID: 2266 Comm: netsniff-ng Not tainted 3.14.0-rc2+ #8
  Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
  task: ffff8801f87f9820 ti: ffff88002cb44000 task.ti: ffff88002cb44000
  RIP: 0010:[]  [] munlock_vma_pages_range+0x2e0/0x2f0
  Call Trace:
    do_munmap+0x18f/0x3b0
    vm_munmap+0x41/0x60
    SyS_munmap+0x22/0x30
    system_call_fastpath+0x1a/0x1f
  RIP   munlock_vma_pages_range+0x2e0/0x2f0
  ---[ end trace a0088dcf07ae10f2 ]---

because munlock_vma_pages_range() thinks it's unexpectedly in the middle
of a THP page.  This can be reproduced with default config since 3.11
kernels.  A reproducer can be found in the kernel's selftest directory
for networking by running ./psock_tpacket.

The problem is that an order=2 compound page (allocated by
alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped
by packet_mmap()) and mistaken for a THP page and assumed to be order=9.

The checks for THP in munlock came with commit ff6a6da60b89 ("mm:
accelerate munlock() treatment of THP pages"), i.e.  since 3.9, but did
not trigger a bug.  It just makes munlock_vma_pages_range() skip such
compound pages until the next 512-pages-aligned page, when it encounters
a head page.  This is however not a problem for vma's where mlocking has
no effect anyway, but it can distort the accounting.

Since commit 7225522bb429 ("mm: munlock: batch non-THP page isolation
and munlock+putback using pagevec") this can trigger a VM_BUG_ON in
PageTransHuge() check.

This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL, a
list of flags that make vma's non-mlockable and non-mergeable.  The
reasoning is that VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is
already on the VM_SPECIAL list, and both are intended for non-LRU pages
where mlocking makes no sense anyway.  Related Lkml discussion can be
found in [2].

 [1] tools/testing/selftests/net/psock_tpacket
 [2] https://lkml.org/lkml/2014/1/10/427

Signed-off-by: Vlastimil Babka 
Signed-off-by: Daniel Borkmann 
Reported-by: Daniel Borkmann 
Tested-by: Daniel Borkmann 
Cc: Thomas Hellstrom 
Cc: John David Anglin 
Cc: HATAYAMA Daisuke 
Cc: Konstantin Khlebnikov 
Cc: Carsten Otte 
Cc: Jared Hulbert 
Tested-by: Hannes Frederic Sowa 
Cc: Kirill A. Shutemov 
Acked-by: Rik van Riel 
Cc: Andrea Arcangeli 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman