linux-toradex.git/net/sched, branch v4.1.10

net: revert "net_sched: move tp->root allocation into fw_init()"

2015-10-03T11:49:17+00:00

[ Upstream commit d8aecb10115497f6cdf841df8c88ebb3ba25fa28 ]

fw filter uses tp->root==NULL to check if it is the old method,
so it doesn't need allocation at all in this case. This patch
reverts the offending commit and adds some comments for old
method to make it obvious.

Fixes: 33f8b9ecdb15 ("net_sched: move tp->root allocation into fw_init()")
Reported-by: Akshat Kakkar 
Cc: Jamal Hadi Salim 
Signed-off-by: Cong Wang 
Acked-by: Jamal Hadi Salim 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

cls_u32: complete the check for non-forced case in u32_destroy()

2015-10-03T11:49:12+00:00

[ Upstream commit a6c1aea044e490da3e59124ec55991fe316818d5 ]

In commit 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
I added a check in u32_destroy() to see if all real filters are gone
for each tp, however, that is only done for root_ht, same is needed
for others.

This can be reproduced by the following tc commands:

tc filter add dev eth0 parent 1:0 prio 5 handle 15: protocol ip u32 divisor 256
tc filter add dev eth0 protocol ip parent 1: prio 5 handle 15:2:2 u32
ht 15:2: match ip src 10.0.0.2 flowid 1:10
tc filter add dev eth0 protocol ip parent 1: prio 5 handle 15:2:3 u32
ht 15:2: match ip src 10.0.0.3 flowid 1:10

Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
Reported-by: Akshat Kakkar 
Cc: Jamal Hadi Salim 
Signed-off-by: Cong Wang 
Signed-off-by: Cong Wang 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

net: sched: fix refcount imbalance in actions

2015-09-29T17:26:24+00:00

[ Upstream commit 28e6b67f0b292f557468c139085303b15f1a678f ]

Since commit 55334a5db5cd ("net_sched: act: refuse to remove bound action
outside"), we end up with a wrong reference count for a tc action.

Test case 1:

  FOO="1,6 0 0 4294967295,"
  BAR="1,6 0 0 4294967294,"
  tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 \
     action bpf bytecode "$FOO"
  tc actions show action bpf
    action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
    index 1 ref 1 bind 1
  tc actions replace action bpf bytecode "$BAR" index 1
  tc actions show action bpf
    action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe
    index 1 ref 2 bind 1
  tc actions replace action bpf bytecode "$FOO" index 1
  tc actions show action bpf
    action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
    index 1 ref 3 bind 1

Test case 2:

  FOO="1,6 0 0 4294967295,"
  tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action ok
  tc actions show action gact
    action order 0: gact action pass
    random type none pass val 0
     index 1 ref 1 bind 1
  tc actions add action drop index 1
    RTNETLINK answers: File exists [...]
  tc actions show action gact
    action order 0: gact action pass
     random type none pass val 0
     index 1 ref 2 bind 1
  tc actions add action drop index 1
    RTNETLINK answers: File exists [...]
  tc actions show action gact
    action order 0: gact action pass
     random type none pass val 0
     index 1 ref 3 bind 1

What happens is that in tcf_hash_check(), we check tcf_common for a given
index and increase tcfc_refcnt and conditionally tcfc_bindcnt when we've
found an existing action. Now there are the following cases:

  1) We do a late binding of an action. In that case, we leave the
     tcfc_refcnt/tcfc_bindcnt increased and are done with the ->init()
     handler. This is correctly handeled.

  2) We replace the given action, or we try to add one without replacing
     and find out that the action at a specific index already exists
     (thus, we go out with error in that case).

In case of 2), we have to undo the reference count increase from
tcf_hash_check() in the tcf_hash_check() function. Currently, we fail to
do so because of the 'tcfc_bindcnt > 0' check which bails out early with
an -EPERM error.

Now, while commit 55334a5db5cd prevents 'tc actions del action ...' on an
already classifier-bound action to drop the reference count (which could
then become negative, wrap around etc), this restriction only accounts for
invocations outside a specific action's ->init() handler.

One possible solution would be to add a flag thus we possibly trigger
the -EPERM ony in situations where it is indeed relevant.

After the patch, above test cases have correct reference count again.

Fixes: 55334a5db5cd ("net_sched: act: refuse to remove bound action outside")
Signed-off-by: Daniel Borkmann 
Reviewed-by: Cong Wang 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

act_bpf: fix memory leaks when replacing bpf programs

2015-09-29T17:26:24+00:00

[ Upstream commit f4eaed28c7834fc049c754f63e6988bbd73778d9 ]

We currently trigger multiple memory leaks when replacing bpf
actions, besides others:

  comm "tc", pid 1909, jiffies 4294851310 (age 1602.796s)
  hex dump (first 32 bytes):
    01 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00  ................
    18 b0 98 6d 00 88 ff ff 00 00 00 00 00 00 00 00  ...m............
  backtrace:
    [] kmemleak_alloc+0x4e/0xb0
    [] __vmalloc_node_range+0x1bd/0x2c0
    [] __vmalloc+0x4a/0x50
    [] bpf_prog_alloc+0x3a/0xa0
    [] bpf_prog_create+0x44/0xa0
    [] tcf_bpf_init+0x28b/0x3c0 [act_bpf]
    [] tcf_action_init_1+0x191/0x1b0
    [] tcf_action_init+0x82/0xf0
    [] tcf_exts_validate+0xb2/0xc0
    [] cls_bpf_modify_existing+0x98/0x340 [cls_bpf]
    [] cls_bpf_change+0x1a6/0x274 [cls_bpf]
    [] tc_ctl_tfilter+0x335/0x910
    [] rtnetlink_rcv_msg+0x95/0x240
    [] netlink_rcv_skb+0xaf/0xc0
    [] rtnetlink_rcv+0x2e/0x40
    [] netlink_unicast+0xef/0x1b0

Issue is that the old content from tcf_bpf is allocated and needs
to be released when we replace it. We seem to do that since the
beginning of act_bpf on the filter and insns, later on the name as
well.

Example test case, after patch:

  # FOO="1,6 0 0 4294967295,"
  # BAR="1,6 0 0 4294967294,"
  # tc actions add action bpf bytecode "$FOO" index 2
  # tc actions show action bpf
   action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
   index 2 ref 1 bind 0
  # tc actions replace action bpf bytecode "$BAR" index 2
  # tc actions show action bpf
   action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe
   index 2 ref 1 bind 0
  # tc actions replace action bpf bytecode "$FOO" index 2
  # tc actions show action bpf
   action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
   index 2 ref 1 bind 0
  # tc actions del action bpf index 2
  [...]
  # echo "scan" > /sys/kernel/debug/kmemleak
  # cat /sys/kernel/debug/kmemleak | grep "comm \"tc\"" | wc -l
  0

Fixes: d23b8ad8ab23 ("tc: add BPF based action")
Signed-off-by: Daniel Borkmann 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

sched: cls_flow: fix panic on filter replace

2015-09-29T17:26:23+00:00

[ Upstream commit 32b2f4b196b37695fdb42b31afcbc15399d6ef91 ]

The following test case causes a NULL pointer dereference in cls_flow:

  tc filter add dev foo parent 1: handle 0x1 flow hash keys dst action ok
  tc filter replace dev foo parent 1: pref 49152 handle 0x1 \
            flow hash keys mark action drop

To be more precise, actually two different panics are fixed, the first
occurs because tcf_exts_init() is not called on the newly allocated
filter when we do a replace. And the second panic uncovered after that
happens since the arguments of list_replace_rcu() are swapped, the old
element needs to be the first argument and the new element the second.

Fixes: 70da9f0bf999 ("net: sched: cls_flow use RCU")
Signed-off-by: Daniel Borkmann 
Acked-by: John Fastabend 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

sched: cls_bpf: fix panic on filter replace

2015-09-29T17:26:23+00:00

[ Upstream commit f6bfc46da6292b630ba389592123f0dd02066172 ]

The following test case causes a NULL pointer dereference in cls_bpf:

  FOO="1,6 0 0 4294967295,"
  tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action ok
  tc filter replace dev foo parent 1: pref 49152 handle 0x1 \
            bpf bytecode "$FOO" flowid 1:1 action drop

The problem is that commit 1f947bf151e9 ("net: sched: rcu'ify cls_bpf")
accidentally swapped the arguments of list_replace_rcu(), the old
element needs to be the first argument and the new element the second.

Fixes: 1f947bf151e9 ("net: sched: rcu'ify cls_bpf")
Signed-off-by: Daniel Borkmann 
Acked-by: John Fastabend 
Acked-by: Alexei Starovoitov 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

fq_codel: fix a use-after-free

2015-09-29T17:26:22+00:00

[ Upstream commit 052cbda41fdc243a8d40cce7ab3a6327b4b2887e ]

Fixes: 25331d6ce42b ("net: sched: implement qstat helper routines")
Cc: John Fastabend 
Signed-off-by: Cong Wang 
Signed-off-by: Cong Wang 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

net_sched: invoke ->attach() after setting dev->qdisc

2015-05-27T18:09:55+00:00

For mq qdisc, we add per tx queue qdisc to root qdisc
for display purpose, however, that happens too early,
before the new dev->qdisc is finally set, this causes
q->list points to an old root qdisc which is going to be
freed right before assigning with a new one.

Fix this by moving ->attach() after setting dev->qdisc.

For the record, this fixes the following crash:

 ------------[ cut here ]------------
 WARNING: CPU: 1 PID: 975 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
 list_del corruption. prev->next should be ffff8800d1998ae8, but was 6b6b6b6b6b6b6b6b
 CPU: 1 PID: 975 Comm: tc Not tainted 4.1.0-rc4+ #1019
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  0000000000000009 ffff8800d73fb928 ffffffff81a44e7f 0000000047574756
  ffff8800d73fb978 ffff8800d73fb968 ffffffff810790da ffff8800cfc4cd20
  ffffffff814e725b ffff8800d1998ae8 ffffffff82381250 0000000000000000
 Call Trace:
  [] dump_stack+0x4c/0x65
  [] warn_slowpath_common+0x9c/0xb6
  [] ? __list_del_entry+0x5a/0x98
  [] warn_slowpath_fmt+0x46/0x48
  [] ? dev_graft_qdisc+0x5e/0x6a
  [] __list_del_entry+0x5a/0x98
  [] list_del+0xe/0x2d
  [] qdisc_list_del+0x1e/0x20
  [] qdisc_destroy+0x30/0xd6
  [] qdisc_graft+0x11d/0x243
  [] tc_get_qdisc+0x1a6/0x1d4
  [] ? mark_lock+0x2e/0x226
  [] rtnetlink_rcv_msg+0x181/0x194
  [] ? rtnl_lock+0x17/0x19
  [] ? rtnl_lock+0x17/0x19
  [] ? __rtnl_unlock+0x17/0x17
  [] netlink_rcv_skb+0x4d/0x93
  [] rtnetlink_rcv+0x26/0x2d
  [] netlink_unicast+0xcb/0x150
  [] ? might_fault+0x59/0xa9
  [] netlink_sendmsg+0x4fa/0x51c
  [] sock_sendmsg_nosec+0x12/0x1d
  [] sock_sendmsg+0x29/0x2e
  [] ___sys_sendmsg+0x1b4/0x23a
  [] ? native_sched_clock+0x35/0x37
  [] ? sched_clock_local+0x12/0x72
  [] ? sched_clock_cpu+0x9e/0xb7
  [] ? current_kernel_time+0xe/0x32
  [] ? lock_release_holdtime.part.29+0x71/0x7f
  [] ? read_seqcount_begin.constprop.27+0x5f/0x76
  [] ? trace_hardirqs_on_caller+0x17d/0x199
  [] ? __fget_light+0x50/0x78
  [] __sys_sendmsg+0x42/0x60
  [] SyS_sendmsg+0x12/0x1c
  [] system_call_fastpath+0x12/0x6f
 ---[ end trace ef29d3fb28e97ae7 ]---

For long term, we probably need to clean up the qdisc_graft() code
in case it hides other bugs like this.

Fixes: 95dc19299f74 ("pkt_sched: give visibility to mq slave qdiscs")
Cc: Jamal Hadi Salim 
Signed-off-by: Cong Wang 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller

net: sched: fix call_rcu() race on classifier module unloads

2015-05-21T22:48:18+00:00

Vijay reported that a loop as simple as ...

  while true; do
    tc qdisc add dev foo root handle 1: prio
    tc filter add dev foo parent 1: u32 match u32 0 0  flowid 1
    tc qdisc del dev foo root
    rmmod cls_u32
  done

... will panic the kernel. Moreover, he bisected the change
apparently introducing it to 78fd1d0ab072 ("netlink: Re-add
locking to netlink_lookup() and seq walker").

The removal of synchronize_net() from the netlink socket
triggering the qdisc to be removed, seems to have uncovered
an RCU resp. module reference count race from the tc API.
Given that RCU conversion was done after e341694e3eb5 ("netlink:
Convert netlink_lookup() to use RCU protected hash table")
which added the synchronize_net() originally, occasion of
hitting the bug was less likely (not impossible though):

When qdiscs that i) support attaching classifiers and,
ii) have at least one of them attached, get deleted, they
invoke tcf_destroy_chain(), and thus call into ->destroy()
handler from a classifier module.

After RCU conversion, all classifier that have an internal
prio list, unlink them and initiate freeing via call_rcu()
deferral.

Meanhile, tcf_destroy() releases already reference to the
tp->ops->owner module before the queued RCU callback handler
has been invoked.

Subsequent rmmod on the classifier module is then not prevented
since all module references are already dropped.

By the time, the kernel invokes the RCU callback handler from
the module, that function address is then invalid.

One way to fix it would be to add an rcu_barrier() to
unregister_tcf_proto_ops() to wait for all pending call_rcu()s
to complete.

synchronize_rcu() is not appropriate as under heavy RCU
callback load, registered call_rcu()s could be deferred
longer than a grace period. In case we don't have any pending
call_rcu()s, the barrier is allowed to return immediately.

Since we came here via unregister_tcf_proto_ops(), there
are no users of a given classifier anymore. Further nested
call_rcu()s pointing into the module space are not being
done anywhere.

Only cls_bpf_delete_prog() may schedule a work item, to
unlock pages eventually, but that is not in the range/context
of cls_bpf anymore.

Fixes: 25d8c0d55f24 ("net: rcu-ify tcf_proto")
Fixes: 9888faefe132 ("net: sched: cls_basic use RCU")
Reported-by: Vijay Subramanian 
Signed-off-by: Daniel Borkmann 
Cc: John Fastabend 
Cc: Eric Dumazet 
Cc: Thomas Graf 
Cc: Jamal Hadi Salim 
Cc: Alexei Starovoitov 
Tested-by: Vijay Subramanian 
Acked-by: Alexei Starovoitov 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller

net_sched: gred: use correct backlog value in WRED mode

2015-05-11T17:26:26+00:00

In WRED mode, the backlog for a single virtual queue (VQ) should not be
used to determine queue behavior; instead the backlog is summed across
all VQs. This sum is currently used when calculating the average queue
lengths. It also needs to be used when determining if the queue's hard
limit has been reached, or when reporting each VQ's backlog via netlink.
q->backlog will only be used if the queue switches out of WRED mode.

Signed-off-by: David Ward 
Signed-off-by: David S. Miller