summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
2015-08-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next, they are: 1) A couple of cleanups for the netfilter core hook from Eric Biederman. 2) Net namespace hook registration, also from Eric. This adds a dependency with the rtnl_lock. This should be fine by now but we have to keep an eye on this because if we ever get the per-subsys nfnl_lock before rtnl we have may problems in the future. But we have room to remove this in the future by propagating the complexity to the clients, by registering hooks for the init netns functions. 3) Update nf_tables to use the new net namespace hook infrastructure, also from Eric. 4) Three patches to refine and to address problems from the new net namespace hook infrastructure. 5) Switch to alternate jumpstack in xtables iff the packet is reentering. This only applies to a very special case, the TEE target, but Eric Dumazet reports that this is slowing down things for everyone else. So let's only switch to the alternate jumpstack if the tee target is in used through a static key. This batch also comes with offline precalculation of the jumpstack based on the callchain depth. From Florian Westphal. 6) Minimal SCTP multihoming support for our conntrack helper, from Michal Kubecek. 7) Reduce nf_bridge_info per skbuff scratchpad area to 32 bytes, from Florian Westphal. 8) Fix several checkpatch errors in bridge netfilter, from Bernhard Thaler. 9) Get rid of useless debug message in ip6t_REJECT, from Subash Abhinov. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: arch/s390/net/bpf_jit_comp.c drivers/net/ethernet/ti/netcp_ethss.c net/bridge/br_multicast.c net/ipv4/ip_fragment.c All four conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31ipv6: Enable auto flow labels by defaultTom Herbert
Initialize auto_flowlabels to one. This enables automatic flow labels, individual socket may disable them using the IPV6_AUTOFLOWLABEL socket option. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31ipv6: Implement different admin modes for automatic flow labelsTom Herbert
Change the meaning of net.ipv6.auto_flowlabels to provide a mode for automatic flow labels generation. There are four modes: 0: flow labels are disabled 1: flow labels are enabled, sockets can opt-out 2: flow labels are allowed, sockets can opt-in 3: flow labels are enabled and enforced, no opt-out for sockets np->autoflowlabel is initialized according to the sysctl value. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabelTom Herbert
We can't call skb_get_hash here since the packet is not complete to do flow_dissector. Create hash based on flowi6 instead. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31ipv6: change ipv6_stub_impl.ipv6_dst_lookup to take net argumentRoopa Prabhu
This patch adds net argument to ipv6_stub_impl.ipv6_dst_lookup for use cases where sk is not available (like mpls). sk appears to be needed to get the namespace 'net' and is optional otherwise. This patch series changes ipv6_stub_impl.ipv6_dst_lookup to take net argument. sk remains optional. All callers of ipv6_stub_impl.ipv6_dst_lookup have been modified to pass net. I have modified them to use already available 'net' in the scope of the call. I can change them to sock_net(sk) to avoid any unintended change in behaviour if sock namespace is different. They dont seem to be from code inspection. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31bpf: add helpers to access tunnel metadataAlexei Starovoitov
Introduce helpers to let eBPF programs attached to TC manipulate tunnel metadata: bpf_skb_[gs]et_tunnel_key(skb, key, size, flags) skb: pointer to skb key: pointer to 'struct bpf_tunnel_key' size: size of 'struct bpf_tunnel_key' flags: room for future extensions First eBPF program that uses these helpers will allocate per_cpu metadata_dst structures that will be used on TX. On RX metadata_dst is allocated by tunnel driver. Typical usage for TX: struct bpf_tunnel_key tkey; ... populate tkey ... bpf_skb_set_tunnel_key(skb, &tkey, sizeof(tkey), 0); bpf_clone_redirect(skb, vxlan_dev_ifindex, 0); RX: struct bpf_tunnel_key tkey = {}; bpf_skb_get_tunnel_key(skb, &tkey, sizeof(tkey), 0); ... lookup or redirect based on tkey ... 'struct bpf_tunnel_key' will be extended in the future by adding elements to the end and the 'size' argument will indicate which fields are populated, thereby keeping backwards compatibility. The 'flags' argument may be used as well when the 'size' is not enough or to indicate completely different layout of bpf_tunnel_key. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30Merge branch 'for-upstream' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2015-07-30 Here's a set of Bluetooth & 802.15.4 patches intended for the 4.3 kernel. - Cleanups & fixes to mac802154 - Refactoring of Intel Bluetooth HCI driver - Various coding style fixes to Bluetooth HCI drivers - Support for Intel Lightning Peak Bluetooth devices - Generic class code in interface descriptor in btusb to match more HW - Refactoring of Bluetooth HS code together with a new config option - Support for BCM4330B1 Broadcom UART controller Let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30net: sched: fix refcount imbalance in actionsDaniel Borkmann
Since commit 55334a5db5cd ("net_sched: act: refuse to remove bound action outside"), we end up with a wrong reference count for a tc action. Test case 1: FOO="1,6 0 0 4294967295," BAR="1,6 0 0 4294967294," tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 \ action bpf bytecode "$FOO" tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe index 1 ref 1 bind 1 tc actions replace action bpf bytecode "$BAR" index 1 tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe index 1 ref 2 bind 1 tc actions replace action bpf bytecode "$FOO" index 1 tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe index 1 ref 3 bind 1 Test case 2: FOO="1,6 0 0 4294967295," tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action ok tc actions show action gact action order 0: gact action pass random type none pass val 0 index 1 ref 1 bind 1 tc actions add action drop index 1 RTNETLINK answers: File exists [...] tc actions show action gact action order 0: gact action pass random type none pass val 0 index 1 ref 2 bind 1 tc actions add action drop index 1 RTNETLINK answers: File exists [...] tc actions show action gact action order 0: gact action pass random type none pass val 0 index 1 ref 3 bind 1 What happens is that in tcf_hash_check(), we check tcf_common for a given index and increase tcfc_refcnt and conditionally tcfc_bindcnt when we've found an existing action. Now there are the following cases: 1) We do a late binding of an action. In that case, we leave the tcfc_refcnt/tcfc_bindcnt increased and are done with the ->init() handler. This is correctly handeled. 2) We replace the given action, or we try to add one without replacing and find out that the action at a specific index already exists (thus, we go out with error in that case). In case of 2), we have to undo the reference count increase from tcf_hash_check() in the tcf_hash_check() function. Currently, we fail to do so because of the 'tcfc_bindcnt > 0' check which bails out early with an -EPERM error. Now, while commit 55334a5db5cd prevents 'tc actions del action ...' on an already classifier-bound action to drop the reference count (which could then become negative, wrap around etc), this restriction only accounts for invocations outside a specific action's ->init() handler. One possible solution would be to add a flag thus we possibly trigger the -EPERM ony in situations where it is indeed relevant. After the patch, above test cases have correct reference count again. Fixes: 55334a5db5cd ("net_sched: act: refuse to remove bound action outside") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29lwtunnel: Make lwtun_encaps[] staticThomas Graf
Any external user should use the registration API instead of accessing this directly. Cc: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29net: Recompute sk_txhash on negative routing adviceTom Herbert
When a connection is failing a transport protocol calls dst_negative_advice to try to get a better route. This patch includes changing the sk_txhash in that function. This provides a rudimentary method to try to find a different path in the network since sk_txhash affects ECMP on the local host and through the network (via flow labels or UDP source port in encapsulation). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29net: Set sk_txhash from a random numberTom Herbert
This patch creates sk_set_txhash and eliminates protocol specific inet_set_txhash and ip6_set_txhash. sk_set_txhash simply sets a random number instead of performing flow dissection. sk_set_txash is also allowed to be called multiple times for the same socket, we'll need this when redoing the hash for negative routing advice. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27tcp: fix recv with flags MSG_WAITALL | MSG_PEEKSabrina Dubroca
Currently, tcp_recvmsg enters a busy loop in sk_wait_data if called with flags = MSG_WAITALL | MSG_PEEK. sk_wait_data waits for sk_receive_queue not empty, but in this case, the receive queue is not empty, but does not contain any skb that we can use. Add a "last skb seen on receive queue" argument to sk_wait_data, so that it sleeps until the receive queue has new skbs. Link: https://bugzilla.kernel.org/show_bug.cgi?id=99461 Link: https://sourceware.org/bugzilla/show_bug.cgi?id=18493 Link: https://bugzilla.redhat.com/show_bug.cgi?id=1205258 Reported-by: Enrico Scholz <rh-bugzilla@ensc.de> Reported-by: Dan Searle <dan@censornet.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27bonding: convert num_grat_arp to the new bonding option APINikolay Aleksandrov
num_grat_arp wasn't converted to the new bonding option API, so do this now and remove the specific sysfs store option in order to use the standard one. num_grat_arp is the same as num_unsol_na so add it as an alias with the same option settings. An important difference is the option name which is matched in bond_sysfs_store_option(). Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Veaceslav Falico <vfalico@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27lwtunnel: change prototype of lwtunnel_state_get()Nicolas Dichtel
It saves some lines and simplify a bit the code when the state is returning by this function. It's also useful to handle a NULL entry. To avoid too long lines, I've also renamed lwtunnel_state_get() and lwtunnel_state_put() to lwtstate_get() and lwtstate_put(). CC: Thomas Graf <tgraf@suug.ch> CC: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26inet: frags: remove INET_FRAG_EVICTED and use list_evictor for the testNikolay Aleksandrov
We can simply remove the INET_FRAG_EVICTED flag to avoid all the flags race conditions with the evictor and use a participation test for the evictor list, when we're at that point (after inet_frag_kill) in the timer there're 2 possible cases: 1. The evictor added the entry to its evictor list while the timer was waiting for the chainlock or 2. The timer unchained the entry and the evictor won't see it In both cases we should be able to see list_evictor correctly due to the sync on the chainlock. Joint work with Florian Westphal. Tested-by: Frank Schreuder <fschreuder@transip.nl> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26inet: frag: change *_frag_mem_limit functions to take netns_frags as argumentFlorian Westphal
Followup patch will call it after inet_frag_queue was freed, so q->net doesn't work anymore (but netf = q->net; free(q); mem_limit(netf) would). Tested-by: Frank Schreuder <fschreuder@transip.nl> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26inet: frag: don't re-use chainlist for evictorFlorian Westphal
commit 65ba1f1ec0eff ("inet: frags: fix a race between inet_evict_bucket and inet_frag_kill") describes the bug, but the fix doesn't work reliably. Problem is that ->flags member can be set on other cpu without chainlock being held by that task, i.e. the RMW-Cycle can clear INET_FRAG_EVICTED bit after we put the element on the evictor private list. We can crash when walking the 'private' evictor list since an element can be deleted from list underneath the evictor. Join work with Nikolay Alexandrov. Fixes: b13d3cbfb8e8 ("inet: frag: move eviction of queues to work queue") Reported-by: Johan Schuijt <johan@transip.nl> Tested-by: Frank Schreuder <fschreuder@transip.nl> Signed-off-by: Nikolay Alexandrov <nikolay@cumulusnetworks.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains ten Netfilter/IPVS fixes, they are: 1) Address refcount leak when creating an expectation from the ctnetlink interface. 2) Fix bug splat in the IDLETIMER target related to sysfs, from Dmitry Torokhov. 3) Resolve panic for unreachable route in IPVS with locally generated traffic in the output path, from Alex Gartrell. 4) Fix wrong source address in rare cases for tunneled traffic in IPVS, from Julian Anastasov. 5) Fix crash if scheduler is changed via ipvsadm -E, again from Julian. 6) Make sure skb->sk is unset for forwarded traffic through IPVS, again from Alex Gartrell. 7) Fix crash with IPVS sync protocol v0 and FTP, from Julian. 8) Reset sender cpu for forwarded traffic in IPVS, also from Julian. 9) Allocate template conntracks through kmalloc() to resolve netns dependency problems with the conntrack kmem_cache. 10) Fix zones with expectations that clash using the same tuple, from Joe Stringer. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-24ipv4: consider TOS in fib_select_defaultJulian Anastasov
fib_select_default considers alternative routes only when res->fi is for the first alias in res->fa_head. In the common case this can happen only when the initial lookup matches the first alias with highest TOS value. This prevents the alternative routes to require specific TOS. This patch solves the problem as follows: - routes that require specific TOS should be returned by fib_select_default only when TOS matches, as already done in fib_table_lookup. This rule implies that depending on the TOS we can have many different lists of alternative gateways and we have to keep the last used gateway (fa_default) in first alias for the TOS instead of using single tb_default value. - as the aliases are ordered by many keys (TOS desc, fib_priority asc), we restrict the possible results to routes with matching TOS and lowest metric (fib_priority) and routes that match any TOS, again with lowest metric. For example, packet with TOS 8 can not use gw3 (not lowest metric), gw4 (different TOS) and gw6 (not lowest metric), all other gateways can be used: tos 8 via gw1 metric 2 <--- res->fa_head and res->fi tos 8 via gw2 metric 2 tos 8 via gw3 metric 3 tos 4 via gw4 tos 0 via gw5 tos 0 via gw6 metric 1 Reported-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-23Bluetooth: hci_check_conn_params() check proper rangeSeungyoun Ju
Slave latency range has been changed in Core Spec. 4.2 by Erratum 5419 of ESR08_V1.0.0. And it should be applied to Core Spec. 4.0 and 4.1. Before: connSlaveLatency <= ((connSupervisionTimeout / connIntervalMax) - 1) After: connSlaveLatency <= ((connSupervisionTimeout / (connIntervalMax*2)) - 1) This patch makes hci_check_conn_params() check the allowable slave latency range using the changed way. Signed-off-by: Seungyoun Ju <sy39.ju@samsung.com> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-07-23Bluetooth: __l2cap_wait_ack() add defensive timeoutDean Jenkins
Add a timeout to prevent the do while loop running in an infinite loop. This ensures that the channel will be instructed to close within 10 seconds so prevents l2cap_sock_shutdown() getting stuck forever. Returns -ENOLINK when the timeout is reached. The channel will be subequently closed and not all data will be ACK'ed. Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-23Bluetooth: __l2cap_wait_ack() use msecs_to_jiffies()Dean Jenkins
Use msecs_to_jiffies() instead of using HZ so that it is easier to specify the time in milliseconds. Also add a #define L2CAP_WAIT_ACK_POLL_PERIOD to specify the 200ms polling period so that it is defined in a single place. Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-23mac802154: do not export ieee802154_rx()Varka Bhadram
Right now there are no other users for ieee802154_rx() in kernel. So lets remove EXPORT_SYMBOL() for this. Also it moves the function prototype from global header file to local header file. Signed-off-by: Varka Bhadram <varkabhadram@gmail.com> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-23cfg802154: add PM hooksVarka Bhadram
This patch help to implement suspend/resume in mac802154, these hooks will be run before the device is suspended and after it resumes. Signed-off-by: Varka Bhadram <varkab@cdac.in> Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-23ip_tunnel: Call ip_tunnel_core_init() from inet_init()Thomas Graf
Convert the module_init() to a invocation from inet_init() since ip_tunnel_core is part of the INET built-in. Fixes: 3093fbe7ff4 ("route: Per route IP tunnel metadata via lightweight tunnel") Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: net/bridge/br_mdb.c br_mdb.c conflict was a function call being removed to fix a bug in 'net' but whose signature was changed in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-22ip_tunnel: Provide tunnel metadata API for CONFIG_INET=nThomas Graf
Account for the configuration FIB_RULES=y && INET=n as FIB_RULES can be selected by IPV6 or DECNET without INET. Fixes: e7030878fc84 ("fib: Add fib rule match on tunnel id") Fixes: 3093fbe7ff4b ("route: Per route IP tunnel metadata via lightweight tunnel") Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21net: #ifdefify sk_classid member of struct sockMathias Krause
The sk_classid member is only required when CONFIG_CGROUP_NET_CLASSID is enabled. #ifdefify it to reduce the size of struct sock on 32 bit systems, at least. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21openvswitch: Use regular VXLAN net_device deviceThomas Graf
This gets rid of all OVS specific VXLAN code in the receive and transmit path by using a VXLAN net_device to represent the vport. Only a small shim layer remains which takes care of handling the VXLAN specific OVS Netlink configuration. Unexports vxlan_sock_add(), vxlan_sock_release(), vxlan_xmit_skb() since they are no longer needed. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21vxlan: Factor out device configurationThomas Graf
This factors out the device configuration out of the RTNL newlink API which allows for in-kernel creation of VXLAN net_devices. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21fib: Add fib rule match on tunnel idThomas Graf
This add the ability to select a routing table based on the tunnel id which allows to maintain separate routing tables for each virtual tunnel network. ip rule add from all tunnel-id 100 lookup 100 ip rule add from all tunnel-id 200 lookup 200 A new static key controls the collection of metadata at tunnel level upon demand. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21route: Per route IP tunnel metadata via lightweight tunnelThomas Graf
This introduces a new IP tunnel lightweight tunnel type which allows to specify IP tunnel instructions per route. Only IPv4 is supported at this point. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21route: Extend flow representation with tunnel keyThomas Graf
Add a new flowi_tunnel structure which is a subset of ip_tunnel_key to allow routes to match on tunnel metadata. For now, the tunnel id is added to flowi_tunnel which allows for routes to be bound to specific virtual tunnels. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21vxlan: Flow based tunnelingThomas Graf
Allows putting a VXLAN device into a new flow-based mode in which skbs with a ip_tunnel_info dst metadata attached will be encapsulated according to the instructions stored in there with the VXLAN device defaults taken into consideration. Similar on the receive side, if the VXLAN_F_COLLECT_METADATA flag is set, the packet processing will populate a ip_tunnel_info struct for each packet received and attach it to the skb using the new metadata dst. The metadata structure will contain the outer header and tunnel header fields which have been stripped off. Layers further up in the stack such as routing, tc or netfitler can later match on these fields and perform forwarding. It is the responsibility of upper layers to ensure that the flag is set if the metadata is needed. The flag limits the additional cost of metadata collecting based on demand. This prepares the VXLAN device to be steered by the routing and other subsystems which allows to support encapsulation for a large number of tunnel endpoints and tunnel ids through a single net_device which improves the scalability. It also allows for OVS to leverage this mode which in turn allows for the removal of the OVS specific VXLAN code. Because the skb is currently scrubed in vxlan_rcv(), the attachment of the new dst metadata is postponed until after scrubing which requires the temporary addition of a new member to vxlan_metadata. This member is removed again in a later commit after the indirect VXLAN receive API has been removed. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21dst: Metadata destinationsThomas Graf
Introduces a new dst_metadata which enables to carry per packet metadata between forwarding and processing elements via the skb->dst pointer. The structure is set up to be a union. Thus, each separate type of metadata requires its own dst instance. If demand arises to carry multiple types of metadata concurrently, metadata dst entries can be made stackable. The metadata dst entry is refcnt'ed as expected for now but a non reference counted use is possible if the reference is forced before queueing the skb. In order to allow allocating dsts with variable length, the existing dst_alloc() is split into a dst_alloc() and dst_init() function. The existing dst_init() function to initialize the subsystem is being renamed to dst_subsys_init() to make it clear what is what. The check before ip_route_input() is changed to ignore metadata dsts and drop the dst inside the routing function thus allowing to interpret metadata in a later commit. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21ip_tunnel: Make ovs_tunnel_info and ovs_key_ipv4_tunnel genericThomas Graf
Rename the tunnel metadata data structures currently internal to OVS and make them generic for use by all IP tunnels. Both structures are kernel internal and will stay that way. Their members are exposed to user space through individual Netlink attributes by OVS. It will therefore be possible to extend/modify these structures without affecting user ABI. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21mpls: ip tunnel supportRoopa Prabhu
This implementation uses lwtunnel infrastructure to register hooks for mpls tunnel encaps. It picks cues from iptunnel_encaps infrastructure and previous mpls iptunnel RFC patches from Eric W. Biederman and Robert Shearman Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21lwtunnel: support dst output redirect functionRoopa Prabhu
This patch introduces lwtunnel_output function to call corresponding lwtunnels output function to xmit the packet. It adds two variants lwtunnel_output and lwtunnel_output6 for ipv4 and ipv6 respectively today. But this is subject to change when lwtstate will reside in dst or dst_metadata (as per upstream discussions). Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21ipv6: support for fib route lwtunnel encap attributesRoopa Prabhu
This patch adds support in ipv6 fib functions to parse Netlink RTA encap attributes and attach encap state data to rt6_info. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21ipv4: support for fib route lwtunnel encap attributesRoopa Prabhu
This patch adds support in ipv4 fib functions to parse user provided encap attributes and attach encap state data to fib_nh and rtable. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21lwtunnel: infrastructure for handling light weight tunnels like mplsRoopa Prabhu
Provides infrastructure to parse/dump/store encap information for light weight tunnels like mpls. Encap information for such tunnels is associated with fib routes. This infrastructure is based on previous suggestions from Eric Biederman to follow the xfrm infrastructure. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21Merge tag 'mac80211-for-davem-2015-07-17' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Some fixes for the current cycle: 1. Arik introduced an rtnl-locked regulatory API to be able to differentiate between place do/don't have the RTNL; this fixes missing locking in some of the code paths 2. Two small mesh bugfixes from Bob, one to avoid treating a certain malformed over-the-air frame and one to avoid sending a garbage field over the air. 3. A fix for powersave during WoWLAN suspend from Krishna Chaitanya. 4. A fix for a powersave vs. aggregation teardown race, from Michal. 5. Thomas reduced the loglevel of CRDA messages to avoid spamming the kernel log with mostly irrelevant information. 6. Tom fixed a dangling debugfs directory pointer that could cause crashes if subsequent addition of the same interface to debugfs failed for some reason. 7. A fix from myself for a list corruption issue in mac80211 during combined interface shutdown/removal - shut down interfaces first and only then remove them to avoid that. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20switchdev: add offload_fwd_mark generator helperScott Feldman
skb->offload_fwd_mark and dev->offload_fwd_mark are 32-bit and should be unique for device and may even be unique for a sub-set of ports within device, so add switchdev helper function to generate unique marks based on port's switch ID and group_ifindex. group_ifindex would typically be the container dev's ifindex, such as the bridge's ifindex. The generator uses a global hash table to store offload_fwd_marks hashed by {switch ID, group_ifindex} key. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20cls_cgroup: factor out classid retrievalDaniel Borkmann
Split out retrieving the cgroups net_cls classid retrieval into its own function, so that it can be reused later on from other parts of the traffic control subsystem. If there's no skb->sk, then the small helper returns 0 as well, which in cls_cgroup terms means 'could not classify'. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20netfilter: fix netns dependencies with conntrack templatesPablo Neira Ayuso
Quoting Daniel Borkmann: "When adding connection tracking template rules to a netns, f.e. to configure netfilter zones, the kernel will endlessly busy-loop as soon as we try to delete the given netns in case there's at least one template present, which is problematic i.e. if there is such bravery that the priviledged user inside the netns is assumed untrusted. Minimal example: ip netns add foo ip netns exec foo iptables -t raw -A PREROUTING -d 1.2.3.4 -j CT --zone 1 ip netns del foo What happens is that when nf_ct_iterate_cleanup() is being called from nf_conntrack_cleanup_net_list() for a provided netns, we always end up with a net->ct.count > 0 and thus jump back to i_see_dead_people. We don't get a soft-lockup as we still have a schedule() point, but the serving CPU spins on 100% from that point onwards. Since templates are normally allocated with nf_conntrack_alloc(), we also bump net->ct.count. The issue why they are not yet nf_ct_put() is because the per netns .exit() handler from x_tables (which would eventually invoke xt_CT's xt_ct_tg_destroy() that drops reference on info->ct) is called in the dependency chain at a *later* point in time than the per netns .exit() handler for the connection tracker. This is clearly a chicken'n'egg problem: after the connection tracker .exit() handler, we've teared down all the connection tracking infrastructure already, so rightfully, xt_ct_tg_destroy() cannot be invoked at a later point in time during the netns cleanup, as that would lead to a use-after-free. At the same time, we cannot make x_tables depend on the connection tracker module, so that the xt_ct_tg_destroy() would be invoked earlier in the cleanup chain." Daniel confirms this has to do with the order in which modules are loaded or having compiled nf_conntrack as modules while x_tables built-in. So we have no guarantees regarding the order in which netns callbacks are executed. Fix this by allocating the templates through kmalloc() from the respective SYNPROXY and CT targets, so they don't depend on the conntrack kmem cache. Then, release then via nf_ct_tmpl_free() from destroy_conntrack(). This branch is marked as unlikely since conntrack templates are rarely allocated and only from the configuration plane path. Note that templates are not kept in any list to avoid further dependencies with nf_conntrack anymore, thus, the tmpl larval list is removed. Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Tested-by: Daniel Borkmann <daniel@iogearbox.net>
2015-07-17cfg80211: use RTNL locked reg_can_beacon for IR-relaxationArik Nemtsov
The RTNL is required to check for IR-relaxation conditions that allow more channels to beacon. Export an RTNL locked version of reg_can_beacon and use it where possible in AP/STA interface type flows, where IR-relaxation may be applicable. Fixes: 06f207fc5418 ("cfg80211: change GO_CONCURRENT to IR_CONCURRENT for STA") Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-07-15ipv6: lock socket in ip6_datagram_connect()Eric Dumazet
ip6_datagram_connect() is doing a lot of socket changes without socket being locked. This looks wrong, at least for udp_lib_rehash() which could corrupt lists because of concurrent udp_sk(sk)->udp_portaddr_hash accesses. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15netfilter: Per network namespace netfilter hooks.Eric W. Biederman
- Add a new set of functions for registering and unregistering per network namespace hooks. - Modify the old global namespace hook functions to use the per network namespace hooks in their implementation, so their remains a single list that needs to be walked for any hook (this is important for keeping the hook priority working and for keeping the code walking the hooks simple). - Only allow registering the per netdevice hooks in the network namespace where the network device lives. - Dynamically allocate the structures in the per network namespace hook list in nf_register_net_hook, and unregister them in nf_unregister_net_hook. Dynamic allocate is required somewhere as the number of network namespaces are not fixed so we might as well allocate them in the registration function. The chain of registered hooks on any list is expected to be small so the cost of walking that list to find the entry we are unregistering should also be small. Performing the management of the dynamically allocated list entries in the registration and unregistration functions keeps the complexity from spreading. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-09ipv6: Nonlocal bindTom Herbert
Add support to allow non-local binds similar to how this was done for IPv4. Non-local binds are very useful in emulating the Internet in a box, etc. This add the ip_nonlocal_bind sysctl under ipv6. Testing: Set up nonlocal binding and receive routing on a host, e.g.: ip -6 rule add from ::/0 iif eth0 lookup 200 ip -6 route add local 2001:0:0:1::/64 dev lo proto kernel scope host table 200 sysctl -w net.ipv6.ip_nonlocal_bind=1 Set up routing to 2001:0:0:1::/64 on peer to go to first host ping6 -I 2001:0:0:1::1 peer-address -- to verify Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>