summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
2019-08-09tcp: Update TCP_BASE_MSS commentJosh Hunt
TCP_BASE_MSS is used as the default initial MSS value when MTU probing is enabled. Update the comment to reflect this. Suggested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Josh Hunt <johunt@akamai.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-09tcp: add new tcp_mtu_probe_floor sysctlJosh Hunt
The current implementation of TCP MTU probing can considerably underestimate the MTU on lossy connections allowing the MSS to get down to 48. We have found that in almost all of these cases on our networks these paths can handle much larger MTUs meaning the connections are being artificially limited. Even though TCP MTU probing can raise the MSS back up we have seen this not to be the case causing connections to be "stuck" with an MSS of 48 when heavy loss is present. Prior to pushing out this change we could not keep TCP MTU probing enabled b/c of the above reasons. Now with a reasonble floor set we've had it enabled for the past 6 months. The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives administrators the ability to control the floor of MSS probing. Signed-off-by: Josh Hunt <johunt@akamai.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-09devlink: remove pointless data_len arg from region snapshot createJiri Pirko
The size of the snapshot has to be the same as the size of the region, therefore no need to pass it again during snapshot creation. Remove the arg and use region->size instead. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-09netfilter: nf_tables: use-after-free in failing rule with bound setPablo Neira Ayuso
If a rule that has already a bound anonymous set fails to be added, the preparation phase releases the rule and the bound set. However, the transaction object from the abort path still has a reference to the set object that is stale, leading to a use-after-free when checking for the set->bound field. Add a new field to the transaction that specifies if the set is bound, so the abort path can skip releasing it since the rule command owns it and it takes care of releasing it. After this update, the set->bound field is removed. [ 24.649883] Unable to handle kernel paging request at virtual address 0000000000040434 [ 24.657858] Mem abort info: [ 24.660686] ESR = 0x96000004 [ 24.663769] Exception class = DABT (current EL), IL = 32 bits [ 24.669725] SET = 0, FnV = 0 [ 24.672804] EA = 0, S1PTW = 0 [ 24.675975] Data abort info: [ 24.678880] ISV = 0, ISS = 0x00000004 [ 24.682743] CM = 0, WnR = 0 [ 24.685723] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000428952000 [ 24.692207] [0000000000040434] pgd=0000000000000000 [ 24.697119] Internal error: Oops: 96000004 [#1] SMP [...] [ 24.889414] Call trace: [ 24.891870] __nf_tables_abort+0x3f0/0x7a0 [ 24.895984] nf_tables_abort+0x20/0x40 [ 24.899750] nfnetlink_rcv_batch+0x17c/0x588 [ 24.904037] nfnetlink_rcv+0x13c/0x190 [ 24.907803] netlink_unicast+0x18c/0x208 [ 24.911742] netlink_sendmsg+0x1b0/0x350 [ 24.915682] sock_sendmsg+0x4c/0x68 [ 24.919185] ___sys_sendmsg+0x288/0x2c8 [ 24.923037] __sys_sendmsg+0x7c/0xd0 [ 24.926628] __arm64_sys_sendmsg+0x2c/0x38 [ 24.930744] el0_svc_common.constprop.0+0x94/0x158 [ 24.935556] el0_svc_handler+0x34/0x90 [ 24.939322] el0_svc+0x8/0xc [ 24.942216] Code: 37280300 f9404023 91014262 aa1703e0 (f9401863) [ 24.948336] ---[ end trace cebbb9dcbed3b56f ]--- Fixes: f6ac85858976 ("netfilter: nf_tables: unbind set in rule from commit path") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-08net/tls: prevent skb_orphan() from leaking TLS plain text with offloadJakub Kicinski
sk_validate_xmit_skb() and drivers depend on the sk member of struct sk_buff to identify segments requiring encryption. Any operation which removes or does not preserve the original TLS socket such as skb_orphan() or skb_clone() will cause clear text leaks. Make the TCP socket underlying an offloaded TLS connection mark all skbs as decrypted, if TLS TX is in offload mode. Then in sk_validate_xmit_skb() catch skbs which have no socket (or a socket with no validation) and decrypted flag set. Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and sk->sk_validate_xmit_skb are slightly interchangeable right now, they all imply TLS offload. The new checks are guarded by CONFIG_TLS_DEVICE because that's the option guarding the sk_buff->decrypted member. Second, smaller issue with orphaning is that it breaks the guarantee that packets will be delivered to device queues in-order. All TLS offload drivers depend on that scheduling property. This means skb_orphan_partial()'s trick of preserving partial socket references will cause issues in the drivers. We need a full orphan, and as a result netem delay/throttling will cause all TLS offload skbs to be dropped. Reusing the sk_buff->decrypted flag also protects from leaking clear text when incoming, decrypted skb is redirected (e.g. by TC). See commit 0608c69c9a80 ("bpf: sk_msg, sock{map|hash} redirect through ULP") for justification why the internal flag is safe. The only location which could leak the flag in is tcp_bpf_sendmsg(), which is taken care of by clearing the previously unused bit. v2: - remove superfluous decrypted mark copy (Willem); - remove the stale doc entry (Boris); - rely entirely on EOR marking to prevent coalescing (Boris); - use an internal sendpages flag instead of marking the socket (Boris). v3 (Willem): - reorganize the can_skb_orphan_partial() condition; - fix the flag leak-in through tcp_bpf_sendmsg. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08netfilter: nf_tables_offload: support indr block callwenxu
nftable support indr-block call. It makes nftable an offload vlan and tunnel device. nft add table netdev firewall nft add chain netdev firewall aclout { type filter hook ingress offload device mlx_pf0vf0 priority - 300 \; } nft add rule netdev firewall aclout ip daddr 10.0.0.1 fwd to vlan0 nft add chain netdev firewall aclin { type filter hook ingress device vlan0 priority - 300 \; } nft add rule netdev firewall aclin ip daddr 10.0.0.7 fwd to mlx_pf0vf0 Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08flow_offload: support get multi-subsystem blockwenxu
It provide a callback list to find the blocks of tc and nft subsystems Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08flow_offload: move tc indirect block to flow offloadwenxu
move tc indirect block to flow_offload and rename it to flow indirect block.The nf_tables can use the indr block architecture. Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08inet: frags: re-introduce skb coalescing for local deliveryGuillaume Nault
Before commit d4289fcc9b16 ("net: IP6 defrag: use rbtrees for IPv6 defrag"), a netperf UDP_STREAM test[0] using big IPv6 datagrams (thus generating many fragments) and running over an IPsec tunnel, reported more than 6Gbps throughput. After that patch, the same test gets only 9Mbps when receiving on a be2net nic (driver can make a big difference here, for example, ixgbe doesn't seem to be affected). By reusing the IPv4 defragmentation code, IPv6 lost fragment coalescing (IPv4 fragment coalescing was dropped by commit 14fe22e33462 ("Revert "ipv4: use skb coalescing in defragmentation"")). Without fragment coalescing, be2net runs out of Rx ring entries and starts to drop frames (ethtool reports rx_drops_no_frags errors). Since the netperf traffic is only composed of UDP fragments, any lost packet prevents reassembly of the full datagram. Therefore, fragments which have no possibility to ever get reassembled pile up in the reassembly queue, until the memory accounting exeeds the threshold. At that point no fragment is accepted anymore, which effectively discards all netperf traffic. When reassembly timeout expires, some stale fragments are removed from the reassembly queue, so a few packets can be received, reassembled and delivered to the netperf receiver. But the nic still drops frames and soon the reassembly queue gets filled again with stale fragments. These long time frames where no datagram can be received explain why the performance drop is so significant. Re-introducing fragment coalescing is enough to get the initial performances again (6.6Gbps with be2net): driver doesn't drop frames anymore (no more rx_drops_no_frags errors) and the reassembly engine works at full speed. This patch is quite conservative and only coalesces skbs for local IPv4 and IPv6 delivery (in order to avoid changing skb geometry when forwarding). Coalescing could be extended in the future if need be, as more scenarios would probably benefit from it. [0]: Test configuration Sender: ip xfrm policy flush ip xfrm state flush ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1 ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir in tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1 ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir out tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow netserver -D -L fc00:2::1 Receiver: ip xfrm policy flush ip xfrm state flush ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1 ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir in tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1 ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir out tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow netperf -H fc00:2::1 -f k -P 0 -L fc00:1::1 -l 60 -t UDP_STREAM -I 99,5 -i 5,5 -T5,5 -6 Signed-off-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Just minor overlapping changes in the conflicts here. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06net: sched: add ingress mirred action to hardware IRJohn Hurley
TC mirred actions (redirect and mirred) can send to egress or ingress of a device. Currently only egress is used for hw offload rules. Modify the intermediate representation for hw offload to include mirred actions that go to ingress. This gives drivers access to such rules and can decide whether or not to offload them. Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06net: tc_act: add helpers to detect ingress mirred actionsJohn Hurley
TC mirred actions can send to egress or ingress on a given netdev. Helpers exist to detect actions that are mirred to egress. Extend the header file to include helpers to detect ingress mirred actions. Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06net: sched: add skbedit of ptype action to hardware IRJohn Hurley
TC rules can impliment skbedit actions. Currently actions that modify the skb mark are passed to offloading drivers via the hardware intermediate representation in the flow_offload API. Extend this to include skbedit actions that modify the packet type of the skb. Such actions may be used to set the ptype to HOST when redirecting a packet to ingress. Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06net: tc_act: add skbedit_ptype helper functionsJohn Hurley
The tc_act header file contains an inline function that checks if an action is changing the skb mark of a packet and a further function to extract the mark. Add similar functions to check for and get skbedit actions that modify the packet type of the skb. Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06net: sched: sample: allow accessing psample_group with rtnlVlad Buslov
Recently implemented support for sample action in flow_offload infra leads to following rcu usage warning: [ 1938.234856] ============================= [ 1938.234858] WARNING: suspicious RCU usage [ 1938.234863] 5.3.0-rc1+ #574 Not tainted [ 1938.234866] ----------------------------- [ 1938.234869] include/net/tc_act/tc_sample.h:47 suspicious rcu_dereference_check() usage! [ 1938.234872] other info that might help us debug this: [ 1938.234875] rcu_scheduler_active = 2, debug_locks = 1 [ 1938.234879] 1 lock held by tc/19540: [ 1938.234881] #0: 00000000b03cb918 (rtnl_mutex){+.+.}, at: tc_new_tfilter+0x47c/0x970 [ 1938.234900] stack backtrace: [ 1938.234905] CPU: 2 PID: 19540 Comm: tc Not tainted 5.3.0-rc1+ #574 [ 1938.234908] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 1938.234911] Call Trace: [ 1938.234922] dump_stack+0x85/0xc0 [ 1938.234930] tc_setup_flow_action+0xed5/0x2040 [ 1938.234944] fl_hw_replace_filter+0x11f/0x2e0 [cls_flower] [ 1938.234965] fl_change+0xd24/0x1b30 [cls_flower] [ 1938.234990] tc_new_tfilter+0x3e0/0x970 [ 1938.235021] ? tc_del_tfilter+0x720/0x720 [ 1938.235028] rtnetlink_rcv_msg+0x389/0x4b0 [ 1938.235038] ? netlink_deliver_tap+0x95/0x400 [ 1938.235044] ? rtnl_dellink+0x2d0/0x2d0 [ 1938.235053] netlink_rcv_skb+0x49/0x110 [ 1938.235063] netlink_unicast+0x171/0x200 [ 1938.235073] netlink_sendmsg+0x224/0x3f0 [ 1938.235091] sock_sendmsg+0x5e/0x60 [ 1938.235097] ___sys_sendmsg+0x2ae/0x330 [ 1938.235111] ? __handle_mm_fault+0x12cd/0x19e0 [ 1938.235125] ? __handle_mm_fault+0x12cd/0x19e0 [ 1938.235138] ? find_held_lock+0x2b/0x80 [ 1938.235147] ? do_user_addr_fault+0x22d/0x490 [ 1938.235160] __sys_sendmsg+0x59/0xa0 [ 1938.235178] do_syscall_64+0x5c/0xb0 [ 1938.235187] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1938.235192] RIP: 0033:0x7ff9a4d597b8 [ 1938.235197] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 54 [ 1938.235200] RSP: 002b:00007ffcfe381c48 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 1938.235205] RAX: ffffffffffffffda RBX: 000000005d4497f9 RCX: 00007ff9a4d597b8 [ 1938.235208] RDX: 0000000000000000 RSI: 00007ffcfe381cb0 RDI: 0000000000000003 [ 1938.235211] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000006 [ 1938.235214] R10: 0000000000404ec2 R11: 0000000000000246 R12: 0000000000000001 [ 1938.235217] R13: 0000000000480640 R14: 0000000000000012 R15: 0000000000000001 Change tcf_sample_psample_group() helper to allow using it from both rtnl and rcu protected contexts. Fixes: a7a7be6087b0 ("net/sched: add sample action to the hardware intermediate representation") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06net: sched: police: allow accessing police->params with rtnlVlad Buslov
Recently implemented support for police action in flow_offload infra leads to following rcu usage warning: [ 1925.881092] ============================= [ 1925.881094] WARNING: suspicious RCU usage [ 1925.881098] 5.3.0-rc1+ #574 Not tainted [ 1925.881100] ----------------------------- [ 1925.881104] include/net/tc_act/tc_police.h:57 suspicious rcu_dereference_check() usage! [ 1925.881106] other info that might help us debug this: [ 1925.881109] rcu_scheduler_active = 2, debug_locks = 1 [ 1925.881112] 1 lock held by tc/18591: [ 1925.881115] #0: 00000000b03cb918 (rtnl_mutex){+.+.}, at: tc_new_tfilter+0x47c/0x970 [ 1925.881124] stack backtrace: [ 1925.881127] CPU: 2 PID: 18591 Comm: tc Not tainted 5.3.0-rc1+ #574 [ 1925.881130] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 1925.881132] Call Trace: [ 1925.881138] dump_stack+0x85/0xc0 [ 1925.881145] tc_setup_flow_action+0x1771/0x2040 [ 1925.881155] fl_hw_replace_filter+0x11f/0x2e0 [cls_flower] [ 1925.881175] fl_change+0xd24/0x1b30 [cls_flower] [ 1925.881200] tc_new_tfilter+0x3e0/0x970 [ 1925.881231] ? tc_del_tfilter+0x720/0x720 [ 1925.881243] rtnetlink_rcv_msg+0x389/0x4b0 [ 1925.881250] ? netlink_deliver_tap+0x95/0x400 [ 1925.881257] ? rtnl_dellink+0x2d0/0x2d0 [ 1925.881264] netlink_rcv_skb+0x49/0x110 [ 1925.881275] netlink_unicast+0x171/0x200 [ 1925.881284] netlink_sendmsg+0x224/0x3f0 [ 1925.881299] sock_sendmsg+0x5e/0x60 [ 1925.881305] ___sys_sendmsg+0x2ae/0x330 [ 1925.881309] ? task_work_add+0x43/0x50 [ 1925.881314] ? fput_many+0x45/0x80 [ 1925.881329] ? __lock_acquire+0x248/0x1930 [ 1925.881342] ? find_held_lock+0x2b/0x80 [ 1925.881347] ? task_work_run+0x7b/0xd0 [ 1925.881359] __sys_sendmsg+0x59/0xa0 [ 1925.881375] do_syscall_64+0x5c/0xb0 [ 1925.881381] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1925.881384] RIP: 0033:0x7feb245047b8 [ 1925.881388] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 54 [ 1925.881391] RSP: 002b:00007ffc2d2a5788 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 1925.881395] RAX: ffffffffffffffda RBX: 000000005d4497ed RCX: 00007feb245047b8 [ 1925.881398] RDX: 0000000000000000 RSI: 00007ffc2d2a57f0 RDI: 0000000000000003 [ 1925.881400] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000006 [ 1925.881403] R10: 0000000000404ec2 R11: 0000000000000246 R12: 0000000000000001 [ 1925.881406] R13: 0000000000480640 R14: 0000000000000012 R15: 0000000000000001 Change tcf_police_rate_bytes_ps() and tcf_police_tcfp_burst() helpers to allow using them from both rtnl and rcu protected contexts. Fixes: 8c8cfc6ed274 ("net/sched: add police action to the hardware intermediate representation") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-05net/tls: partially revert fix transition through disconnect with closeJakub Kicinski
Looks like we were slightly overzealous with the shutdown() cleanup. Even though the sock->sk_state can reach CLOSED again, socket->state will not got back to SS_UNCONNECTED once connections is ESTABLISHED. Meaning we will see EISCONN if we try to reconnect, and EINVAL if we try to listen. Only listen sockets can be shutdown() and reused, but since ESTABLISHED sockets can never be re-connected() or used for listen() we don't need to try to clean up the ULP state early. Fixes: 32857cf57f92 ("net/tls: fix transition through disconnect with close") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-03netfilter: synproxy: rename mss synproxy_options fieldFernando Fernandez Mancera
After introduce "mss_encode" field in the synproxy_options struct the field "mss" is a little confusing. It has been renamed to "mss_option". Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-31Merge tag 'mac80211-next-for-davem-2019-07-31' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== We have a reasonably large number of changes: * lots more HE (802.11ax) support, particularly things relevant for the the AP side, but also mesh support * debugfs cleanups from Greg * some more work on extended key ID * start using genl parallel_ops, as preparation for weaning ourselves off RTNL and getting parallelism * various other changes all over ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-31Merge tag 'mac80211-for-davem-2019-07-31' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Just a few fixes: * revert NETIF_F_LLTX usage as it caused problems * avoid warning on WMM parameters from AP that are too short * fix possible null-ptr dereference in hwsim * fix interface combinations with 4-addr and crypto control ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-31mac80211: allow setting spatial reuse parameters from bss_confJohn Crispin
Store the OBSS PD parameters inside bss_conf when bringing up an AP and/or when a station connects to an AP. This allows the driver to configure the HW accordingly. Signed-off-by: John Crispin <john@phrozen.org> Link: https://lore.kernel.org/r/20190730163701.18836-3-john@phrozen.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-31cfg80211: add support for parsing OBBS_PD attributesJohn Crispin
Add the data structure, policy and parsing code allowing userland to send the OBSS PD information into the kernel. Signed-off-by: John Crispin <john@phrozen.org> Link: https://lore.kernel.org/r/20190730163701.18836-2-john@phrozen.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-30tcp: add skb-less helpers to retrieve SYN cookiePetar Penkov
This patch allows generation of a SYN cookie before an SKB has been allocated, as is the case at XDP. Signed-off-by: Petar Penkov <ppenkov@google.com> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-30net: dsa: ksz: Add KSZ8795 tag codeTristram Ha
Add DSA tag code for Microchip KSZ8795 switch. The switch is simpler and the tag is only 1 byte, instead of 2 as is the case with KSZ9477. Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com> Signed-off-by: Marek Vasut <marex@denx.de> Cc: Andrew Lunn <andrew@lunn.ch> Cc: David S. Miller <davem@davemloft.net> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Tristram Ha <Tristram.Ha@microchip.com> Cc: Vivien Didelot <vivien.didelot@gmail.com> Cc: Woojung Huh <woojung.huh@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-29mac80211: propagate HE operation info into bss_confJohn Crispin
Upon a successful assoc a station shall store the content of the HE operation element inside bss_conf so that the driver can setup the hardware accordingly. Signed-off-by: Shashidhar Lakkavalli <slakkavalli@datto.com> Signed-off-by: John Crispin <john@phrozen.org> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/20190729102342.8659-2-john@phrozen.org [use struct copy] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-26mac80211: add IEEE80211_KEY_FLAG_GENERATE_MMIE to ieee80211_key_flagsLorenzo Bianconi
Add IEEE80211_KEY_FLAG_GENERATE_MMIE flag to ieee80211_key_flags in order to allow the driver to notify mac80211 to generate MMIE and that it requires sequence number generation only. This is a preliminary patch to add BIP_CMAC_128 hw support to mt7615 driver Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/dfe275f9aa0f1cc6b33085f9efd5d8447f68ad13.1563228405.git.lorenzo@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-26crypto: user - make NETLINK_CRYPTO work inside netnsOndrej Mosnacek
Currently, NETLINK_CRYPTO works only in the init network namespace. It doesn't make much sense to cut it out of the other network namespaces, so do the minor plumbing work necessary to make it work in any network namespace. Code inspired by net/core/sock_diag.c. Tested using kcapi-dgst from libkcapi [1]: Before: # unshare -n kcapi-dgst -c sha256 </dev/null | wc -c libkcapi - Error: Netlink error: sendmsg failed libkcapi - Error: Netlink error: sendmsg failed libkcapi - Error: NETLINK_CRYPTO: cannot obtain cipher information for hmac(sha512) (is required crypto_user.c patch missing? see documentation) 0 After: # unshare -n kcapi-dgst -c sha256 </dev/null | wc -c 32 [1] https://github.com/smuellerDD/libkcapi Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-07-26{nl,mac}80211: fix interface combinations on crypto controlled devicesManikanta Pubbisetty
Commit 33d915d9e8ce ("{nl,mac}80211: allow 4addr AP operation on crypto controlled devices") has introduced a change which allows 4addr operation on crypto controlled devices (ex: ath10k). This change has inadvertently impacted the interface combinations logic on such devices. General rule is that software interfaces like AP/VLAN should not be listed under supported interface combinations and should not be considered during validation of these combinations; because of the aforementioned change, AP/VLAN interfaces(if present) will be checked against interfaces supported by the device and blocks valid interface combinations. Consider a case where an AP and AP/VLAN are up and running; when a second AP device is brought up on the same physical device, this AP will be checked against the AP/VLAN interface (which will not be part of supported interface combinations of the device) and blocks second AP to come up. Add a new API cfg80211_iftype_allowed() to fix the problem, this API works for all devices with/without SW crypto control. Signed-off-by: Manikanta Pubbisetty <mpubbise@codeaurora.org> Fixes: 33d915d9e8ce ("{nl,mac}80211: allow 4addr AP operation on crypto controlled devices") Link: https://lore.kernel.org/r/1563779690-9716-1-git-send-email-mpubbise@codeaurora.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-26mac80211: add xmit rate to struct ieee80211_tx_statusJohn Crispin
Right now struct ieee80211_tx_rate cannot hold HE rates. Lets use struct ieee80211_tx_status instead. This will also make the code future-proof for when we have EHT. Signed-off-by: John Crispin <john@phrozen.org> Link: https://lore.kernel.org/r/20190714154419.11854-2-john@phrozen.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-26mac80211: AMPDU handling for rekeys with Extended Key IDAlexander Wetzel
Extended Key ID allows A-MPDU sessions while rekeying as long as each A-MPDU aggregates only MPDUs with one keyid together. Drivers able to segregate MPDUs accordingly can tell mac80211 to not stop A-MPDU sessions when rekeying by setting the new flag IEEE80211_HW_AMPDU_KEYBORDER_SUPPORT. Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de> Link: https://lore.kernel.org/r/20190629195015.19680-3-alexander@wetzel-home.de Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-26mac80211: Simplify Extended Key ID APIAlexander Wetzel
1) Drop IEEE80211_HW_EXT_KEY_ID_NATIVE and let drivers directly set the NL80211_EXT_FEATURE_EXT_KEY_ID flag. 2) Drop IEEE80211_HW_NO_AMPDU_KEYBORDER_SUPPORT and simply assume all drivers are unable to handle A-MPDU key borders. The new Extended Key ID API now requires all mac80211 drivers to set NL80211_EXT_FEATURE_EXT_KEY_ID when they implement set_key() and can handle Extended Key ID. For drivers not providing set_key() mac80211 itself enables Extended Key ID support, using the internal SW crypto services. Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de> Link: https://lore.kernel.org/r/20190629195015.19680-2-alexander@wetzel-home.de Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-26mac80211: add tx dequeue function for process contextErik Stromdahl
Since ieee80211_tx_dequeue() must not be called with softirqs enabled (i.e. from process context without proper disable of bottom halves), we add a wrapper that disables bottom halves before calling ieee80211_tx_dequeue() The new function is named ieee80211_tx_dequeue_ni() just as all other from-process-context versions found in mac80211. The documentation of ieee80211_tx_dequeue() is also updated so it mentions that the function should not be called from process context. Signed-off-by: Erik Stromdahl <erik.stromdahl@gmail.com> Link: https://lore.kernel.org/r/20190617200140.6189-1-erik.stromdahl@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-26mac80211: remove unused and unneeded remove_sta_debugfs callbackGreg Kroah-Hartman
The remove_sta_debugfs callback in struct rate_control_ops is no longer used by any driver, as there is no need for it (the debugfs directory is already removed recursivly by the mac80211 core.) Because no one needs it, just remove it to keep anyone else from accidentally using it in the future. Cc: Johannes Berg <johannes@sipsolutions.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: linux-wireless@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20190612142658.12792-5-gregkh@linuxfoundation.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-26mac80211: pass the vif to cancel_remain_on_channelEmmanuel Grumbach
This low level driver can find it useful to get the vif when a remain on channel session is cancelled. iwlwifi will need this soon. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Link: https://lore.kernel.org/r/20190723180001.5828-1-emmanuel.grumbach@intel.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf 2019-07-25 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) fix segfault in libbpf, from Andrii. 2) fix gso_segs access, from Eric. 3) tls/sockmap fixes, from Jakub and John. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-23net: sched: include mpls actions in hardware intermediate representationJohn Hurley
A recent addition to TC actions is the ability to manipulate the MPLS headers on packets. In preparation to offload such actions to hardware, update the IR code to accept and prepare the new actions. Note that no driver currently impliments the MPLS dec_ttl action so this is not included. Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-22net-ipv6-ndisc: add support for RFC7710 RA Captive Portal IdentifierMaciej Żenczykowski
This is trivial since we already have support for the entirely identical (from the kernel's point of view) RDNSS and DNSSL that also contain opaque data that needs to be passed down to userspace. As specified in RFC7710, Captive Portal option contains a URL. 8-bit identifier of the option type as assigned by the IANA is 37. This option should also be treated as userland. Hence, treat ND option 37 as userland (Captive Portal support) See: https://tools.ietf.org/html/rfc7710 https://www.iana.org/assignments/icmpv6-parameters/icmpv6-parameters.xhtml Fixes: e35f30c131a56 Signed-off-by: Maciej Żenczykowski <maze@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: Remin Nguyen Van <reminv@google.com> Cc: Alexey I. Froloff <raorn@raorn.name> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-22bpf: sockmap/tls, close can race with map freeJohn Fastabend
When a map free is called and in parallel a socket is closed we have two paths that can potentially reset the socket prot ops, the bpf close() path and the map free path. This creates a problem with which prot ops should be used from the socket closed side. If the map_free side completes first then we want to call the original lowest level ops. However, if the tls path runs first we want to call the sockmap ops. Additionally there was no locking around prot updates in TLS code paths so the prot ops could be changed multiple times once from TLS path and again from sockmap side potentially leaving ops pointed at either TLS or sockmap when psock and/or tls context have already been destroyed. To fix this race first only update ops inside callback lock so that TLS, sockmap and lowest level all agree on prot state. Second and a ULP callback update() so that lower layers can inform the upper layer when they are being removed allowing the upper layer to reset prot ops. This gets us close to allowing sockmap and tls to be stacked in arbitrary order but will save that patch for *next trees. v4: - make sure we don't free things for device; - remove the checks which swap the callbacks back only if TLS is at the top. Reported-by: syzbot+06537213db7ba2745c4a@syzkaller.appspotmail.com Fixes: 02c558b2d5d6 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-22net/tls: fix transition through disconnect with closeJohn Fastabend
It is possible (via shutdown()) for TCP socks to go through TCP_CLOSE state via tcp_disconnect() without actually calling tcp_close which would then call the tls close callback. Because of this a user could disconnect a socket then put it in a LISTEN state which would break our assumptions about sockets always being ESTABLISHED state. More directly because close() can call unhash() and unhash is implemented by sockmap if a sockmap socket has TLS enabled we can incorrectly destroy the psock from unhash() and then call its close handler again. But because the psock (sockmap socket representation) is already destroyed we call close handler in sk->prot. However, in some cases (TLS BASE/BASE case) this will still point at the sockmap close handler resulting in a circular call and crash reported by syzbot. To fix both above issues implement the unhash() routine for TLS. v4: - add note about tls offload still needing the fix; - move sk_proto to the cold cache line; - split TX context free into "release" and "free", otherwise the GC work itself is in already freed memory; - more TX before RX for consistency; - reuse tls_ctx_free(); - schedule the GC work after we're done with context to avoid UAF; - don't set the unhash in all modes, all modes "inherit" TLS_BASE's callbacks anyway; - disable the unhash hook for TLS_HW. Fixes: 3c4d7559159bf ("tls: kernel TLS support") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-22net/tls: remove sock unlock/lock around strp_done()John Fastabend
The tls close() callback currently drops the sock lock to call strp_done(). Split up the RX cleanup into stopping the strparser and releasing most resources, syncing strparser and finally freeing the context. To avoid the need for a strp_done() call on the cleanup path of device offload make sure we don't arm the strparser until we are sure init will be successful. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-22net/tls: remove close callback sock unlock/lock around TX work flushJohn Fastabend
The tls close() callback currently drops the sock lock, makes a cancel_delayed_work_sync() call, and then relocks the sock. By restructuring the code we can avoid droping lock and then reclaiming it. To simplify this we do the following, tls_sk_proto_close set_bit(CLOSING) set_bit(SCHEDULE) cancel_delay_work_sync() <- cancel workqueue lock_sock(sk) ... release_sock(sk) strp_done() Setting the CLOSING bit prevents the SCHEDULE bit from being cleared by any workqueue items e.g. if one happens to be scheduled and run between when we set SCHEDULE bit and cancel work. Then because SCHEDULE bit is set now no new work will be scheduled. Tested with net selftests and bpf selftests. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-22net/tls: don't arm strparser immediately in tls_set_sw_offload()Jakub Kicinski
In tls_set_device_offload_rx() we prepare the software context for RX fallback and proceed to add the connection to the device. Unfortunately, software context prep includes arming strparser so in case of a later error we have to release the socket lock to call strp_done(). In preparation for not releasing the socket lock half way through callbacks move arming strparser into a separate function. Following patches will make use of that. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-21tcp: be more careful in tcp_fragment()Eric Dumazet
Some applications set tiny SO_SNDBUF values and expect TCP to just work. Recent patches to address CVE-2019-11478 broke them in case of losses, since retransmits might be prevented. We should allow these flows to make progress. This patch allows the first and last skb in retransmit queue to be split even if memory limits are hit. It also adds the some room due to the fact that tcp_sendmsg() and tcp_sendpage() might overshoot sk_wmem_queued by about one full TSO skb (64KB size). Note this allowance was already present in stable backports for kernels < 4.15 Note for < 4.15 backports : tcp_rtx_queue_tail() will probably look like : static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk) { struct sk_buff *skb = tcp_send_head(sk); return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk); } Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrew Prout <aprout@ll.mit.edu> Tested-by: Andrew Prout <aprout@ll.mit.edu> Tested-by: Jonathan Lemon <jonathan.lemon@gmail.com> Tested-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Christoph Paasch <cpaasch@apple.com> Cc: Jonathan Looney <jtl@netflix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-20nl80211: fix VENDOR_CMD_RAW_DATAJohannes Berg
Since ERR_PTR() is an inline, not a macro, just open-code it here so it's usable as an initializer, fixing the build in brcmfmac. Reported-by: Arend Van Spriel <arend.vanspriel@broadcom.com> Fixes: 901bb9891855 ("nl80211: require and validate vendor command policy") Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-19net: flow_offload: add flow_block structure and use itPablo Neira Ayuso
This object stores the flow block callbacks that are attached to this block. Update flow_block_cb_lookup() to take this new object. This patch restores the block sharing feature. Fixes: da3eeb904ff4 ("net: flow_offload: add list handling functions") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-19net: flow_offload: rename tc_setup_cb_t to flow_setup_cb_tPablo Neira Ayuso
Rename this type definition and adapt users. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-19net: flow_offload: remove netns parameter from flow_block_cb_alloc()Pablo Neira Ayuso
No need to annotate the netns on the flow block callback object, flow_block_cb_is_busy() already checks for used blocks. Fixes: d63db30c8537 ("net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Fix a deadlock when module is requested via netlink_bind() in nfnetlink, from Florian Westphal. 2) Fix ipt_rpfilter and ip6t_rpfilter with VRF, from Miaohe Lin. 3) Skip master comparison in SIP helper to fix expectation clash under two valid scenarios, from xiao ruizhu. 4) Remove obsolete comments in nf_conntrack codebase, from Yonatan Goldschmidt. 5) Fix redirect extension module autoload, from Christian Hesse. 6) Fix incorrect mssg option sent to client in synproxy, from Fernando Fernandez. 7) Fix incorrect window calculations in TCP conntrack, from Florian Westphal. 8) Don't bail out when updating basechain policy due to recent offload works, also from Florian. 9) Allow symhash to use modulus 1 as other hash extensions do, from Laura.Garcia. 10) Missing NAT chain module autoload for the inet family, from Phil Sutter. 11) Fix missing adjustment of TCP RST packet in synproxy, from Fernando Fernandez. 12) Skip EAGAIN path when nft_meta_bridge is built-in or not selected. 13) Conntrack bridge does not depend on nf_tables_bridge. 14) Turn NF_TABLES_BRIDGE into tristate to fix possible link break of nft_meta_bridge, from Arnd Bergmann. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-18tcp: fix tcp_set_congestion_control() use from bpf hookEric Dumazet
Neal reported incorrect use of ns_capable() from bpf hook. bpf_setsockopt(...TCP_CONGESTION...) -> tcp_set_congestion_control() -> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN) -> ns_capable_common() -> current_cred() -> rcu_dereference_protected(current->cred, 1) Accessing 'current' in bpf context makes no sense, since packets are processed from softirq context. As Neal stated : The capability check in tcp_set_congestion_control() was written assuming a system call context, and then was reused from a BPF call site. The fix is to add a new parameter to tcp_set_congestion_control(), so that the ns_capable() call is only performed under the right context. Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Lawrence Brakmo <brakmo@fb.com> Reported-by: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-17xfrm interface: fix management of phydevNicolas Dichtel
With the current implementation, phydev cannot be removed: $ ip link add dummy type dummy $ ip link add xfrm1 type xfrm dev dummy if_id 1 $ ip l d dummy kernel:[77938.465445] unregister_netdevice: waiting for dummy to become free. Usage count = 1 Manage it like in ip tunnels, ie just keep the ifindex. Not that the side effect, is that the phydev is now optional. Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Tested-by: Julien Floret <julien.floret@6wind.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>