linux-toradex.git/include, branch v4.4.5

modules: fix longstanding /proc/kallsyms vs module insertion race.

2016-03-09T23:34:56+00:00

commit 8244062ef1e54502ef55f54cced659913f244c3e upstream.

For CONFIG_KALLSYMS, we keep two symbol tables and two string tables.
There's one full copy, marked SHF_ALLOC and laid out at the end of the
module's init section.  There's also a cut-down version that only
contains core symbols and strings, and lives in the module's core
section.

After module init (and before we free the module memory), we switch
the mod->symtab, mod->num_symtab and mod->strtab to point to the core
versions.  We do this under the module_mutex.

However, kallsyms doesn't take the module_mutex: it uses
preempt_disable() and rcu tricks to walk through the modules, because
it's used in the oops path.  It's also used in /proc/kallsyms.
There's nothing atomic about the change of these variables, so we can
get the old (larger!) num_symtab and the new symtab pointer; in fact
this is what I saw when trying to reproduce.

By grouping these variables together, we can use a
carefully-dereferenced pointer to ensure we always get one or the
other (the free of the module init section is already done in an RCU
callback, so that's safe).  We allocate the init one at the end of the
module init section, and keep the core one inside the struct module
itself (it could also have been allocated at the end of the module
core, but that's probably overkill).

[ Rebased for 4.4-stable and older, because the following changes aren't
  in the older trees:
  - e0224418516b4d8a6c2160574bac18447c354ef0: adds arg to is_core_symbol
  - 7523e4dc5057e157212b4741abd6256e03404cf1: module_init/module_core/init_size/core_size
    become init_layout.base/core_layout.base/init_layout.size/core_layout.size.
]

Reported-by: Weilong Chen 
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111541
Cc: stable@kernel.org
Signed-off-by: Rusty Russell 
Signed-off-by: Greg Kroah-Hartman

block: get the 1st and last bvec via helpers

2016-03-09T23:34:56+00:00

commit 25e71a99f10e444cd00bb2ebccb11e1c9fb672b1 upstream.

This patch applies the two introduced helpers to
figure out the 1st and last bvec, and fixes the
original way after bio splitting.

Reported-by: Sagi Grimberg 
Reviewed-by: Sagi Grimberg 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block: check virt boundary in bio_will_gap()

2016-03-09T23:34:56+00:00

commit e0af29171aa8912e1ca95023b75ef336cd70d661 upstream.

In the following patch, the way for figuring out
the last bvec will be changed with a bit cost introduced,
so return immediately if the queue doesn't have virt
boundary limit. Actually most of devices have not
this limit.

Reviewed-by: Sagi Grimberg 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

tracing: Do not have 'comm' filter override event 'comm' field

2016-03-09T23:34:52+00:00

commit e57cbaf0eb006eaa207395f3bfd7ce52c1b5539c upstream.

Commit 9f61668073a8d "tracing: Allow triggers to filter for CPU ids and
process names" added a 'comm' filter that will filter events based on the
current tasks struct 'comm'. But this now hides the ability to filter events
that have a 'comm' field too. For example, sched_migrate_task trace event.
That has a 'comm' field of the task to be migrated.

 echo 'comm == "bash"' > events/sched_migrate_task/filter

will now filter all sched_migrate_task events for tasks named "bash" that
migrates other tasks (in interrupt context), instead of seeing when "bash"
itself gets migrated.

This fix requires a couple of changes.

1) Change the look up order for filter predicates to look at the events
   fields before looking at the generic filters.

2) Instead of basing the filter function off of the "comm" name, have the
   generic "comm" filter have its own filter_type (FILTER_COMM). Test
   against the type instead of the name to assign the filter function.

3) Add a new "COMM" filter that works just like "comm" but will filter based
   on the current task, even if the trace event contains a "comm" field.

Do the same for "cpu" field, adding a FILTER_CPU and a filter "CPU".

Fixes: 9f61668073a8d "tracing: Allow triggers to filter for CPU ids and process names"
Reported-by: Matt Fleming 
Signed-off-by: Steven Rostedt 
Signed-off-by: Greg Kroah-Hartman

writeback: flush inode cgroup wb switches instead of pinning super_block

2016-03-09T23:34:52+00:00

commit a1a0e23e49037c23ea84bc8cc146a03584d13577 upstream.

If cgroup writeback is in use, inodes can be scheduled for
asynchronous wb switching.  Before 5ff8eaac1636 ("writeback: keep
superblock pinned during cgroup writeback association switches"), this
could race with umount leading to super_block being destroyed while
inodes are pinned for wb switching.  5ff8eaac1636 fixed it by bumping
s_active while wb switches are in flight; however, this allowed
in-flight wb switches to make umounts asynchronous when the userland
expected synchronosity - e.g. fsck immediately following umount may
fail because the device is still busy.

This patch removes the problematic super_block pinning and instead
makes generic_shutdown_super() flush in-flight wb switches.  wb
switches are now executed on a dedicated isw_wq so that they can be
flushed and isw_nr_in_flight keeps track of the number of in-flight wb
switches so that flushing can be avoided in most cases.

v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE
    check in inode_switch_wbs() as Jan an Al suggested.

Signed-off-by: Tejun Heo 
Reported-by: Tahsin Erdogan 
Cc: Jan Kara 
Cc: Al Viro 
Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
Fixes: 5ff8eaac1636 ("writeback: keep superblock pinned during cgroup writeback association switches")
Reviewed-by: Jan Kara 
Tested-by: Tahsin Erdogan 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block: bio: introduce helpers to get the 1st and last bvec

2016-03-09T23:34:52+00:00

commit 7bcd79ac50d9d83350a835bdb91c04ac9e098412 upstream.

The bio passed to bio_will_gap() may be fast cloned from upper
layer(dm, md, bcache, fs, ...), or from bio splitting in block
core.

Unfortunately bio_will_gap() just figures out the last bvec via
'bi_io_vec[prev->bi_vcnt - 1]' directly, and this way is obviously
wrong.

This patch introduces two helpers for getting the first and last
bvec of one bio for fixing the issue.

Reported-by: Sagi Grimberg 
Reviewed-by: Sagi Grimberg 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

libata: Align ata_device's id on a cacheline

2016-03-09T23:34:52+00:00

commit 4ee34ea3a12396f35b26d90a094c75db95080baa upstream.

The id buffer in ata_device is a DMA target, but it isn't explicitly
cacheline aligned. Due to this, adjacent fields can be overwritten with
stale data from memory on non coherent architectures. As a result, the
kernel is sometimes unable to communicate with an ATA device.

Fix this by ensuring that the id buffer is cacheline aligned.

This issue is similar to that fixed by Commit 84bda12af31f
("libata: align ap->sector_buf").

Signed-off-by: Harvey Hunt 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman

libata: fix HDIO_GET_32BIT ioctl

2016-03-09T23:34:52+00:00

commit 287e6611ab1eac76c2c5ebf6e345e04c80ca9c61 upstream.

As reported by Soohoon Lee, the HDIO_GET_32BIT ioctl does not
work correctly in compat mode with libata.

I have investigated the issue further and found multiple problems
that all appeared with the same commit that originally introduced
HDIO_GET_32BIT handling in libata back in linux-2.6.8 and presumably
also linux-2.4, as the code uses "copy_to_user(arg, &val, 1)" to copy
a 'long' variable containing either 0 or 1 to user space.

The problems with this are:

* On big-endian machines, this will always write a zero because it
  stores the wrong byte into user space.

* In compat mode, the upper three bytes of the variable are updated
  by the compat_hdio_ioctl() function, but they now contain
  uninitialized stack data.

* The hdparm tool calling this ioctl uses a 'static long' variable
  to store the result. This means at least the upper bytes are
  initialized to zero, but calling another ioctl like HDIO_GET_MULTCOUNT
  would fill them with data that remains stale when the low byte
  is overwritten. Fortunately libata doesn't implement any of the
  affected ioctl commands, so this would only happen when we query
  both an IDE and an ATA device in the same command such as
  "hdparm -N -c /dev/hda /dev/sda"

* The libata code for unknown reasons started using ATA_IOC_GET_IO32
  and ATA_IOC_SET_IO32 as aliases for HDIO_GET_32BIT and HDIO_SET_32BIT,
  while the ioctl commands that were added later use the normal
  HDIO_* names. This is harmless but rather confusing.

This addresses all four issues by changing the code to use put_user()
on an 'unsigned long' variable in HDIO_GET_32BIT, like the IDE subsystem
does, and by clarifying the names of the ioctl commands.

Signed-off-by: Arnd Bergmann 
Reported-by: Soohoon Lee 
Tested-by: Soohoon Lee 
Signed-off-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman

target: Fix WRITE_SAME/DISCARD conversion to linux 512b sectors

2016-03-09T23:34:51+00:00

commit 8a9ebe717a133ba7bc90b06047f43cc6b8bcb8b3 upstream.

In a couple places we are not converting to/from the Linux
block layer 512 bytes sectors.

1.

The request queue values and what we do are a mismatch of
things:

max_discard_sectors - This is in linux block layer 512 byte
sectors. We are just copying this to max_unmap_lba_count.

discard_granularity - This is in bytes. We are converting it
to Linux block layer 512 byte sectors.

discard_alignment - This is in bytes. We are just copying
this over.

The problem is that the core LIO code exports these values in
spc_emulate_evpd_b0 and we use them to test request arguments
in sbc_execute_unmap, but we never convert to the block size
we export to the initiator. If we are not using 512 byte sectors
then we are exporting the wrong values or are checks are off.
And, for the discard_alignment/bytes case we are just plain messed
up.

2.

blkdev_issue_discard's start and number of sector arguments
are supposed to be in linux block layer 512 byte sectors. We are
currently passing in the values we get from the initiator which
might be based on some other sector size.

There is a similar problem in iblock_execute_write_same where
the bio functions want values in 512 byte sectors but we are
passing in what we got from the initiator.

Signed-off-by: Mike Christie 
Signed-off-by: Nicholas Bellinger 
[ kamal: backport to 4.4-stable: no unmap_zeroes_data ]
Signed-off-by: Kamal Mostafa 
Signed-off-by: Greg Kroah-Hartman

use ->d_seq to get coherency between ->d_inode and ->d_flags

2016-03-09T23:34:49+00:00

commit a528aca7f359f4b0b1d72ae406097e491a5ba9ea upstream.

Games with ordering and barriers are way too brittle.  Just
bump ->d_seq before and after updating ->d_inode and ->d_flags
type bits, so that verifying ->d_seq would guarantee they are
coherent.

Signed-off-by: Al Viro 
Signed-off-by: Greg Kroah-Hartman