linux-toradex.git/include/linux, branch v4.4.115

bpf: avoid false sharing of map refcount with max_entries

2018-02-03T16:04:24+00:00

[ upstream commit be95a845cc4402272994ce290e3ad928aff06cb9 ]

In addition to commit b2157399cc98 ("bpf: prevent out-of-bounds
speculation") also change the layout of struct bpf_map such that
false sharing of fast-path members like max_entries is avoided
when the maps reference counter is altered. Therefore enforce
them to be placed into separate cachelines.

pahole dump after change:

  struct bpf_map {
        const struct bpf_map_ops  * ops;                 /*     0     8 */
        struct bpf_map *           inner_map_meta;       /*     8     8 */
        void *                     security;             /*    16     8 */
        enum bpf_map_type          map_type;             /*    24     4 */
        u32                        key_size;             /*    28     4 */
        u32                        value_size;           /*    32     4 */
        u32                        max_entries;          /*    36     4 */
        u32                        map_flags;            /*    40     4 */
        u32                        pages;                /*    44     4 */
        u32                        id;                   /*    48     4 */
        int                        numa_node;            /*    52     4 */
        bool                       unpriv_array;         /*    56     1 */

        /* XXX 7 bytes hole, try to pack */

        /* --- cacheline 1 boundary (64 bytes) --- */
        struct user_struct *       user;                 /*    64     8 */
        atomic_t                   refcnt;               /*    72     4 */
        atomic_t                   usercnt;              /*    76     4 */
        struct work_struct         work;                 /*    80    32 */
        char                       name[16];             /*   112    16 */
        /* --- cacheline 2 boundary (128 bytes) --- */

        /* size: 128, cachelines: 2, members: 17 */
        /* sum members: 121, holes: 1, sum holes: 7 */
  };

Now all entries in the first cacheline are read only throughout
the life time of the map, set up once during map creation. Overall
struct size and number of cachelines doesn't change from the
reordering. struct bpf_map is usually first member and embedded
in map structs in specific map implementations, so also avoid those
members to sit at the end where it could potentially share the
cacheline with first map values e.g. in the array since remote
CPUs could trigger map updates just as well for those (easily
dirtying members like max_entries intentionally as well) while
having subsequent values in cache.

Quoting from Google's Project Zero blog [1]:

  Additionally, at least on the Intel machine on which this was
  tested, bouncing modified cache lines between cores is slow,
  apparently because the MESI protocol is used for cache coherence
  [8]. Changing the reference counter of an eBPF array on one
  physical CPU core causes the cache line containing the reference
  counter to be bounced over to that CPU core, making reads of the
  reference counter on all other CPU cores slow until the changed
  reference counter has been written back to memory. Because the
  length and the reference counter of an eBPF array are stored in
  the same cache line, this also means that changing the reference
  counter on one physical CPU core causes reads of the eBPF array's
  length to be slow on other physical CPU cores (intentional false
  sharing).

While this doesn't 'control' the out-of-bounds speculation through
masking the index as in commit b2157399cc98, triggering a manipulation
of the map's reference counter is really trivial, so lets not allow
to easily affect max_entries from it.

Splitting to separate cachelines also generally makes sense from
a performance perspective anyway in that fast-path won't have a
cache miss if the map gets pinned, reused in other progs, etc out
of control path, thus also avoids unintentional false sharing.

  [1] https://googleprojectzero.blogspot.ch/2018/01/reading-privileged-memory-with-side.html

Signed-off-by: Daniel Borkmann 
Signed-off-by: Alexei Starovoitov 
Signed-off-by: Greg Kroah-Hartman

tcp: __tcp_hdrlen() helper

2018-01-31T11:06:13+00:00

commit d9b3fca27385eafe61c3ca6feab6cb1e7dc77482 upstream.

tcp_hdrlen is wasteful if you already have a pointer to struct tcphdr.
This splits the size calculation into a helper function that can be
used if a struct tcphdr is already available.

Signed-off-by: Craig Gallek 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

Revert "module: Add retpoline tag to VERMAGIC"

2018-01-31T11:06:11+00:00

commit 5132ede0fe8092b043dae09a7cc32b8ae7272baa upstream.

This reverts commit 6cfb521ac0d5b97470883ff9b7facae264b7ab12.

Turns out distros do not want to make retpoline as part of their "ABI",
so this patch should not have been merged.  Sorry Andi, this was my
fault, I suggested it when your original patch was the "correct" way of
doing this instead.

Reported-by: Jiri Kosina 
Fixes: 6cfb521ac0d5 ("module: Add retpoline tag to VERMAGIC")
Acked-by: Andi Kleen 
Cc: Thomas Gleixner 
Cc: David Woodhouse 
Cc: rusty@rustcorp.com.au
Cc: arjan.van.de.ven@intel.com
Cc: jeyu@kernel.org
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

netfilter: fix IS_ERR_VALUE usage

2018-01-31T11:06:10+00:00

commit 92b4423e3a0bc5d43ecde4bcad871f8b5ba04efd upstream.

This is a forward-port of the original patch from Andrzej Hajda,
he said:

"IS_ERR_VALUE should be used only with unsigned long type.
Otherwise it can work incorrectly. To achieve this function
xt_percpu_counter_alloc is modified to return unsigned long,
and its result is assigned to temporary variable to perform
error checking, before assigning to .pcnt field.

The patch follows conclusion from discussion on LKML [1][2].

[1]: http://permalink.gmane.org/gmane.linux.kernel/2120927
[2]: http://permalink.gmane.org/gmane.linux.kernel/2150581"

Original patch from Andrzej is here:

http://patchwork.ozlabs.org/patch/582970/

This patch has clashed with input validation fixes for x_tables.

Signed-off-by: Pablo Neira Ayuso 
Signed-off-by: Greg Kroah-Hartman 
Acked-by: Michal Kubecek

netfilter: x_tables: speed up jump target validation

2018-01-31T11:06:10+00:00

commit f4dc77713f8016d2e8a3295e1c9c53a21f296def upstream.

The dummy ruleset I used to test the original validation change was broken,
most rules were unreachable and were not tested by mark_source_chains().

In some cases rulesets that used to load in a few seconds now require
several minutes.

sample ruleset that shows the behaviour:

echo "*filter"
for i in $(seq 0 100000);do
        printf ":chain_%06x - [0:0]\n" $i
done
for i in $(seq 0 100000);do
   printf -- "-A INPUT -j chain_%06x\n" $i
   printf -- "-A INPUT -j chain_%06x\n" $i
   printf -- "-A INPUT -j chain_%06x\n" $i
done
echo COMMIT

[ pipe result into iptables-restore ]

This ruleset will be about 74mbyte in size, with ~500k searches
though all 500k[1] rule entries. iptables-restore will take forever
(gave up after 10 minutes)

Instead of always searching the entire blob for a match, fill an
array with the start offsets of every single ipt_entry struct,
then do a binary search to check if the jump target is present or not.

After this change ruleset restore times get again close to what one
gets when reverting 36472341017529e (~3 seconds on my workstation).

[1] every user-defined rule gets an implicit RETURN, so we get
300k jumps + 100k userchains + 100k returns -> 500k rule entries

Fixes: 36472341017529e ("netfilter: x_tables: validate targets of jumps")
Reported-by: Jeff Wu 
Tested-by: Jeff Wu 
Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
Acked-by: Michal Kubecek 
Signed-off-by: Greg Kroah-Hartman

drivers: base: cacheinfo: fix x86 with CONFIG_OF enabled

2018-01-31T11:06:08+00:00

commit fac51482577d5e05bbb0efa8d602a3c2111098bf upstream.

With CONFIG_OF enabled on x86, we get the following error on boot:
"
	Failed to find cpu0 device node
 	Unable to detect cache hierarchy from DT for CPU 0
"
and the cacheinfo fails to get populated in the corresponding sysfs
entries. This is because cache_setup_of_node looks for of_node for
setting up the shared cpu_map without checking that it's already
populated in the architecture specific callback.

In order to indicate that the shared cpu_map is already populated, this
patch introduces a boolean `cpu_map_populated` in struct cpu_cacheinfo
that can be used by the generic code to skip cache_shared_cpu_map_setup.

This patch also sets that boolean for x86.

Cc: Greg Kroah-Hartman 
Signed-off-by: Sudeep Holla 
Signed-off-by: Mian Yousaf Kaukab 
Signed-off-by: Greg Kroah-Hartman

time: Avoid undefined behaviour in ktime_add_safe()

2018-01-31T11:06:08+00:00

commit 979515c5645830465739254abc1b1648ada41518 upstream.

I ran into this:

    ================================================================================
    UBSAN: Undefined behaviour in kernel/time/hrtimer.c:310:16
    signed integer overflow:
    9223372036854775807 + 50000 cannot be represented in type 'long long int'
    CPU: 2 PID: 4798 Comm: trinity-c2 Not tainted 4.8.0-rc1+ #91
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
     0000000000000000 ffff88010ce6fb88 ffffffff82344740 0000000041b58ab3
     ffffffff84f97a20 ffffffff82344694 ffff88010ce6fbb0 ffff88010ce6fb60
     000000000000c350 ffff88010ce6f968 dffffc0000000000 ffffffff857bc320
    Call Trace:
     [] dump_stack+0xac/0xfc
     [] ? _atomic_dec_and_lock+0xc4/0xc4
     [] ubsan_epilogue+0xd/0x8a
     [] handle_overflow+0x202/0x23d
     [] ? val_to_string.constprop.6+0x11e/0x11e
     [] ? timerqueue_add+0x151/0x410
     [] ? hrtimer_start_range_ns+0x3b8/0x1380
     [] ? memset+0x31/0x40
     [] __ubsan_handle_add_overflow+0xe/0x10
     [] hrtimer_nanosleep+0x5d9/0x790
     [] ? hrtimer_init_sleeper+0x80/0x80
     [] ? __might_sleep+0x5b/0x260
     [] common_nsleep+0x20/0x30
     [] SyS_clock_nanosleep+0x197/0x210
     [] ? SyS_clock_getres+0x150/0x150
     [] ? __this_cpu_preempt_check+0x13/0x20
     [] ? __context_tracking_exit.part.3+0x30/0x1b0
     [] ? SyS_clock_getres+0x150/0x150
     [] do_syscall_64+0x1b3/0x4b0
     [] entry_SYSCALL64_slow_path+0x25/0x25
    ================================================================================

Add a new ktime_add_unsafe() helper which doesn't check for overflow, but
doesn't throw a UBSAN warning when it does overflow either.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Richard Cochran 
Cc: Prarit Bhargava 
Signed-off-by: Vegard Nossum 
Signed-off-by: John Stultz 
Signed-off-by: Jiri Slaby 
Signed-off-by: Greg Kroah-Hartman

sched/deadline: Use the revised wakeup rule for suspending constrained dl tasks

2018-01-31T11:06:07+00:00

commit 3effcb4247e74a51f5d8b775a1ee4abf87cc089a upstream.

We have been facing some problems with self-suspending constrained
deadline tasks. The main reason is that the original CBS was not
designed for such sort of tasks.

One problem reported by Xunlei Pang takes place when a task
suspends, and then is awakened before the deadline, but so close
to the deadline that its remaining runtime can cause the task
to have an absolute density higher than allowed. In such situation,
the original CBS assumes that the task is facing an early activation,
and so it replenishes the task and set another deadline, one deadline
in the future. This rule works fine for implicit deadline tasks.
Moreover, it allows the system to adapt the period of a task in which
the external event source suffered from a clock drift.

However, this opens the window for bandwidth leakage for constrained
deadline tasks. For instance, a task with the following parameters:

  runtime   = 5 ms
  deadline  = 7 ms
  [density] = 5 / 7 = 0.71
  period    = 1000 ms

If the task runs for 1 ms, and then suspends for another 1ms,
it will be awakened with the following parameters:

  remaining runtime = 4
  laxity = 5

presenting a absolute density of 4 / 5 = 0.80.

In this case, the original CBS would assume the task had an early
wakeup. Then, CBS will reset the runtime, and the absolute deadline will
be postponed by one relative deadline, allowing the task to run.

The problem is that, if the task runs this pattern forever, it will keep
receiving bandwidth, being able to run 1ms every 2ms. Following this
behavior, the task would be able to run 500 ms in 1 sec. Thus running
more than the 5 ms / 1 sec the admission control allowed it to run.

Trying to address the self-suspending case, Luca Abeni, Giuseppe
Lipari, and Juri Lelli [1] revisited the CBS in order to deal with
self-suspending tasks. In the new approach, rather than
replenishing/postponing the absolute deadline, the revised wakeup rule
adjusts the remaining runtime, reducing it to fit into the allowed
density.

A revised version of the idea is:

At a given time t, the maximum absolute density of a task cannot be
higher than its relative density, that is:

  runtime / (deadline - t) <= dl_runtime / dl_deadline

Knowing the laxity of a task (deadline - t), it is possible to move
it to the other side of the equality, thus enabling to define max
remaining runtime a task can use within the absolute deadline, without
over-running the allowed density:

  runtime = (dl_runtime / dl_deadline) * (deadline - t)

For instance, in our previous example, the task could still run:

  runtime = ( 5 / 7 ) * 5
  runtime = 3.57 ms

Without causing damage for other deadline tasks. It is note worthy
that the laxity cannot be negative because that would cause a negative
runtime. Thus, this patch depends on the patch:

  df8eac8cafce ("sched/deadline: Throttle a constrained deadline task activated after the deadline")

Which throttles a constrained deadline task activated after the
deadline.

Finally, it is also possible to use the revised wakeup rule for
all other tasks, but that would require some more discussions
about pros and cons.

[The main difference from the original commit is that
 the BW_SHIFT define was not present yet. As BW_SHIFT was
 introduced in a new feature, I just used the value (20),
 likewise we used to use before the #define.
 Other changes were required because of comments. - bistrot]

Reported-by: Xunlei Pang 
Signed-off-by: Daniel Bristot de Oliveira 
[peterz: replaced dl_is_constrained with dl_is_implicit]
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Juri Lelli 
Cc: Linus Torvalds 
Cc: Luca Abeni 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Romulo Silva de Oliveira 
Cc: Steven Rostedt 
Cc: Thomas Gleixner 
Cc: Tommaso Cucinotta 
Link: http://lkml.kernel.org/r/5c800ab3a74a168a84ee5f3f84d12a02e11383be.1495803804.git.bristot@redhat.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Daniel Bristot de Oliveira 
Signed-off-by: Greg Kroah-Hartman

module: Add retpoline tag to VERMAGIC

2018-01-23T18:50:15+00:00

commit 6cfb521ac0d5b97470883ff9b7facae264b7ab12 upstream.

Add a marker for retpoline to the module VERMAGIC. This catches the case
when a non RETPOLINE compiled module gets loaded into a retpoline kernel,
making it insecure.

It doesn't handle the case when retpoline has been runtime disabled.  Even
in this case the match of the retcompile status will be enforced.  This
implies that even with retpoline run time disabled all modules loaded need
to be recompiled.

Signed-off-by: Andi Kleen 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Greg Kroah-Hartman 
Acked-by: David Woodhouse 
Cc: rusty@rustcorp.com.au
Cc: arjan.van.de.ven@intel.com
Cc: jeyu@kernel.org
Cc: torvalds@linux-foundation.org
Link: https://lkml.kernel.org/r/20180116205228.4890-1-andi@firstfloor.org
Signed-off-by: Greg Kroah-Hartman

kconfig.h: use __is_defined() to check if MODULE is defined

2018-01-23T18:50:12+00:00

commit 4f920843d248946545415c1bf6120942048708ed upstream.

The macro MODULE is not a config option, it is a per-file build
option.  So, config_enabled(MODULE) is not sensible.  (There is
another case in include/linux/export.h, where config_enabled() is
used against a non-config option.)

This commit renames some macros in include/linux/kconfig.h for the
use for non-config macros and replaces config_enabled(MODULE) with
__is_defined(MODULE).

I am keeping config_enabled() because it is still referenced from
some places, but I expect it would be deprecated in the future.

Signed-off-by: Masahiro Yamada 
Signed-off-by: Michal Marek 
Signed-off-by: Razvan Ghitulete 
Signed-off-by: Greg Kroah-Hartman