diff options
| author | Alexei Starovoitov <ast@kernel.org> | 2018-05-03 16:20:12 -0700 |
|---|---|---|
| committer | Alexei Starovoitov <ast@kernel.org> | 2018-05-03 16:21:21 -0700 |
| commit | 08dbc7a66af2321661173c04d872eba44003cc13 (patch) | |
| tree | 5632a9676aa3bbe9d0b72d61d55cb92619d7d7d7 /include | |
| parent | 03f5781be2c7b7e728d724ac70ba10799cc710d7 (diff) | |
| parent | b4b8faa1ded7a3bb34db374c692a51cea29f9080 (diff) | |
Merge branch 'AF_XDP-initial-support'
Björn Töpel says:
====================
This patch set introduces a new address family called AF_XDP that is
optimized for high performance packet processing and, in upcoming
patch sets, zero-copy semantics. In this patch set, we have removed
all zero-copy related code in order to make it smaller, simpler and
hopefully more review friendly. This patch set only supports copy-mode
for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
for RX using the XDP_DRV path. Zero-copy support requires XDP and
driver changes that Jesper Dangaard Brouer is working on. Some of his
work has already been accepted. We will publish our zero-copy support
for RX and TX on top of his patch sets at a later point in time.
An AF_XDP socket (XSK) is created with the normal socket()
syscall. Associated with each XSK are two queues: the RX queue and the
TX queue. A socket can receive packets on the RX queue and it can send
packets on the TX queue. These queues are registered and sized with
the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
mandatory to have at least one of these queues for each socket. In
contrast to AF_PACKET V2/V3 these descriptor queues are separated from
packet buffers. An RX or TX descriptor points to a data buffer in a
memory area called a UMEM. RX and TX can share the same UMEM so that a
packet does not have to be copied between RX and TX. Moreover, if a
packet needs to be kept for a while due to a possible retransmit, the
descriptor that points to that packet can be changed to point to
another and reused right away. This again avoids copying data.
This new dedicated packet buffer area is call a UMEM. It consists of a
number of equally size frames and each frame has a unique frame id. A
descriptor in one of the queues references a frame by referencing its
frame id. The user space allocates memory for this UMEM using whatever
means it feels is most appropriate (malloc, mmap, huge pages,
etc). This memory area is then registered with the kernel using the new
setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
and the COMPLETION queue. The fill queue is used by the application to
send down frame ids for the kernel to fill in with RX packet
data. References to these frames will then appear in the RX queue of
the XSK once they have been received. The completion queue, on the
other hand, contains frame ids that the kernel has transmitted
completely and can now be used again by user space, for either TX or
RX. Thus, the frame ids appearing in the completion queue are ids that
were previously transmitted using the TX queue. In summary, the RX and
FILL queues are used for the RX path and the TX and COMPLETION queues
are used for the TX path.
The socket is then finally bound with a bind() call to a device and a
specific queue id on that device, and it is not until bind is
completed that traffic starts to flow. Note that in this patch set,
all packet data is copied out to user-space.
A new feature in this patch set is that the UMEM can be shared between
processes, if desired. If a process wants to do this, it simply skips
the registration of the UMEM and its corresponding two queues, sets a
flag in the bind call and submits the XSK of the process it would like
to share UMEM with as well as its own newly created XSK socket. The
new process will then receive frame id references in its own RX queue
that point to this shared UMEM. Note that since the queue structures
are single-consumer / single-producer (for performance reasons), the
new process has to create its own socket with associated RX and TX
queues, since it cannot share this with the other process. This is
also the reason that there is only one set of FILL and COMPLETION
queues per UMEM. It is the responsibility of a single process to
handle the UMEM. If multiple-producer / multiple-consumer queues are
implemented in the future, this requirement could be relaxed.
How is then packets distributed between these two XSK? We have
introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
full). The user-space application can place an XSK at an arbitrary
place in this map. The XDP program can then redirect a packet to a
specific index in this map and at this point XDP validates that the
XSK in that map was indeed bound to that device and queue number. If
not, the packet is dropped. If the map is empty at that index, the
packet is also dropped. This also means that it is currently mandatory
to have an XDP program loaded (and one XSK in the XSKMAP) to be able
to get any traffic to user space through the XSK.
AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
driver does not have support for XDP, or XDP_SKB is explicitly chosen
when loading the XDP program, XDP_SKB mode is employed that uses SKBs
together with the generic XDP support and copies out the data to user
space. A fallback mode that works for any network device. On the other
hand, if the driver has support for XDP, it will be used by the AF_XDP
code to provide better performance, but there is still a copy of the
data into user space.
There is a xdpsock benchmarking/test application included that
demonstrates how to use AF_XDP sockets with both private and shared
UMEMs. Say that you would like your UDP traffic from port 4242 to end
up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
for this:
ethtool -N p3p2 rx-flow-hash udp4 fn
ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
action 16
Running the rxdrop benchmark in XDP_DRV mode can then be done
using:
samples/bpf/xdpsock -i p3p2 -q 16 -r -N
For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
can be displayed with "-h", as usual.
We have run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for TR/RX and one for the user space application. The
memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
8192MB and with 8 of those DIMMs in the system we have 64 GB of total
memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
NIC is Intel I40E 40Gbit/s using the i40e driver.
Below are the results in Mpps of the I40E NIC benchmark runs for 64
and 1500 byte packets, generated by a commercial packet generator HW
outputing packets at full 40 Gbit/s line rate. The results are without
retpoline so that we can compare against previous numbers. With
retpoline, the AF_XDP numbers drop with between 10 - 15 percent.
AF_XDP performance 64 byte packets. Results from V2 in parenthesis.
Benchmark XDP_SKB XDP_DRV
rxdrop 2.9(3.0) 9.6(9.5)
txpush 2.6(2.5) NA*
l2fwd 1.9(1.9) 2.5(2.5) (TX using XDP_SKB in both cases)
AF_XDP performance 1500 byte packets:
Benchmark XDP_SKB XDP_DRV
rxdrop 2.1(2.2) 3.3(3.3)
l2fwd 1.4(1.4) 1.8(1.8) (TX using XDP_SKB in both cases)
* NA since we have no support for TX using the XDP_DRV infrastructure
in this patch set. This is for a future patch set since it involves
changes to the XDP NDOs. Some of this has been upstreamed by Jesper
Dangaard Brouer.
XDP performance on our system as a base line:
64 byte packets:
XDP stats CPU pps issue-pps
XDP-RX CPU 16 32.3(32.9)M 0
1500 byte packets:
XDP stats CPU pps issue-pps
XDP-RX CPU 16 3.3(3.3)M 0
Changes from V2:
* Fixed a race in XSKMAP map found by Will. The code has been
completely rearchitected and is now simpler, faster, and hopefully
also not racy. Please review and check if it holds.
If you would like to diff V2 against V3, you can find them here:
https://github.com/bjoto/linux/tree/af-xdp-v2-on-bpf-next
https://github.com/bjoto/linux/tree/af-xdp-v3-on-bpf-next
The structure of the patch set is as follows:
Patches 1-3: Basic socket and umem plumbing
Patches 4-9: RX support together with the new XSKMAP
Patches 10-13: TX support
Patch 14: Statistics support with getsockopt()
Patch 15: Sample application
We based this patch set on bpf-next commit a3fe1f6f2ada ("tools:
bpftool: change time format for program 'loaded at:' information")
To do for this patch set:
* Syzkaller torture session being worked on
Post-series plan:
* Optimize performance
* Kernel selftest
* Kernel load module support of AF_XDP would be nice. Unclear how to
achieve this though since our XDP code depends on net/core.
* Support for AF_XDP sockets without an XPD program loaded. In this
case all the traffic on a queue should go up to the user space socket.
* Daniel Borkmann's suggestion for a "copy to XDP socket, and return
XDP_PASS" for a tcpdump-like functionality.
* And of course getting to zero-copy support in small increments,
starting with TX then adding RX.
Thanks: Björn and Magnus
====================
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Diffstat (limited to 'include')
| -rw-r--r-- | include/linux/bpf.h | 25 | ||||
| -rw-r--r-- | include/linux/bpf_types.h | 3 | ||||
| -rw-r--r-- | include/linux/filter.h | 2 | ||||
| -rw-r--r-- | include/linux/netdevice.h | 1 | ||||
| -rw-r--r-- | include/linux/socket.h | 5 | ||||
| -rw-r--r-- | include/net/xdp.h | 1 | ||||
| -rw-r--r-- | include/net/xdp_sock.h | 66 | ||||
| -rw-r--r-- | include/uapi/linux/bpf.h | 1 | ||||
| -rw-r--r-- | include/uapi/linux/if_xdp.h | 87 |
9 files changed, 189 insertions, 2 deletions
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index c553f6f9c6b0..68ecdb4eea09 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -676,6 +676,31 @@ static inline int sock_map_prog(struct bpf_map *map, } #endif +#if defined(CONFIG_XDP_SOCKETS) +struct xdp_sock; +struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key); +int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp, + struct xdp_sock *xs); +void __xsk_map_flush(struct bpf_map *map); +#else +struct xdp_sock; +static inline struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, + u32 key) +{ + return NULL; +} + +static inline int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp, + struct xdp_sock *xs) +{ + return -EOPNOTSUPP; +} + +static inline void __xsk_map_flush(struct bpf_map *map) +{ +} +#endif + /* verifier prototypes for helper functions called from eBPF programs */ extern const struct bpf_func_proto bpf_map_lookup_elem_proto; extern const struct bpf_func_proto bpf_map_update_elem_proto; diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index 2b28fcf6f6ae..d7df1b323082 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -49,4 +49,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops) #endif BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops) +#if defined(CONFIG_XDP_SOCKETS) +BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops) +#endif #endif diff --git a/include/linux/filter.h b/include/linux/filter.h index 64899c04c1a6..b7f81e3a70cb 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -760,7 +760,7 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off, * This does not appear to be a real limitation for existing software. */ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb, - struct bpf_prog *prog); + struct xdp_buff *xdp, struct bpf_prog *prog); int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, struct bpf_prog *prog); diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 366c32891158..a30435118530 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2486,6 +2486,7 @@ void dev_disable_lro(struct net_device *dev); int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *newskb); int dev_queue_xmit(struct sk_buff *skb); int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv); +int dev_direct_xmit(struct sk_buff *skb, u16 queue_id); int register_netdevice(struct net_device *dev); void unregister_netdevice_queue(struct net_device *dev, struct list_head *head); void unregister_netdevice_many(struct list_head *head); diff --git a/include/linux/socket.h b/include/linux/socket.h index ea50f4a65816..7ed4713d5337 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -207,8 +207,9 @@ struct ucred { * PF_SMC protocol family that * reuses AF_INET address family */ +#define AF_XDP 44 /* XDP sockets */ -#define AF_MAX 44 /* For now.. */ +#define AF_MAX 45 /* For now.. */ /* Protocol families, same as address families. */ #define PF_UNSPEC AF_UNSPEC @@ -257,6 +258,7 @@ struct ucred { #define PF_KCM AF_KCM #define PF_QIPCRTR AF_QIPCRTR #define PF_SMC AF_SMC +#define PF_XDP AF_XDP #define PF_MAX AF_MAX /* Maximum queue length specifiable by listen. */ @@ -338,6 +340,7 @@ struct ucred { #define SOL_NFC 280 #define SOL_KCM 281 #define SOL_TLS 282 +#define SOL_XDP 283 /* IPX options */ #define IPX_TYPE 1 diff --git a/include/net/xdp.h b/include/net/xdp.h index 137ad5f9f40f..0b689cf561c7 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -104,6 +104,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp) } void xdp_return_frame(struct xdp_frame *xdpf); +void xdp_return_buff(struct xdp_buff *xdp); int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq, struct net_device *dev, u32 queue_index); diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h new file mode 100644 index 000000000000..185f4928fbda --- /dev/null +++ b/include/net/xdp_sock.h @@ -0,0 +1,66 @@ +/* SPDX-License-Identifier: GPL-2.0 + * AF_XDP internal functions + * Copyright(c) 2018 Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ + +#ifndef _LINUX_XDP_SOCK_H +#define _LINUX_XDP_SOCK_H + +#include <linux/mutex.h> +#include <net/sock.h> + +struct net_device; +struct xsk_queue; +struct xdp_umem; + +struct xdp_sock { + /* struct sock must be the first member of struct xdp_sock */ + struct sock sk; + struct xsk_queue *rx; + struct net_device *dev; + struct xdp_umem *umem; + struct list_head flush_node; + u16 queue_id; + struct xsk_queue *tx ____cacheline_aligned_in_smp; + /* Protects multiple processes in the control path */ + struct mutex mutex; + u64 rx_dropped; +}; + +struct xdp_buff; +#ifdef CONFIG_XDP_SOCKETS +int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp); +int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp); +void xsk_flush(struct xdp_sock *xs); +bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs); +#else +static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) +{ + return -ENOTSUPP; +} + +static inline int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) +{ + return -ENOTSUPP; +} + +static inline void xsk_flush(struct xdp_sock *xs) +{ +} + +static inline bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs) +{ + return false; +} +#endif /* CONFIG_XDP_SOCKETS */ + +#endif /* _LINUX_XDP_SOCK_H */ diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 8daef7326bb7..a3a495052511 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -116,6 +116,7 @@ enum bpf_map_type { BPF_MAP_TYPE_DEVMAP, BPF_MAP_TYPE_SOCKMAP, BPF_MAP_TYPE_CPUMAP, + BPF_MAP_TYPE_XSKMAP, }; enum bpf_prog_type { diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h new file mode 100644 index 000000000000..77b88c4efe98 --- /dev/null +++ b/include/uapi/linux/if_xdp.h @@ -0,0 +1,87 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note + * + * if_xdp: XDP socket user-space interface + * Copyright(c) 2018 Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * Author(s): Björn Töpel <bjorn.topel@intel.com> + * Magnus Karlsson <magnus.karlsson@intel.com> + */ + +#ifndef _LINUX_IF_XDP_H +#define _LINUX_IF_XDP_H + +#include <linux/types.h> + +/* Options for the sxdp_flags field */ +#define XDP_SHARED_UMEM 1 + +struct sockaddr_xdp { + __u16 sxdp_family; + __u32 sxdp_ifindex; + __u32 sxdp_queue_id; + __u32 sxdp_shared_umem_fd; + __u16 sxdp_flags; +}; + +/* XDP socket options */ +#define XDP_RX_RING 1 +#define XDP_TX_RING 2 +#define XDP_UMEM_REG 3 +#define XDP_UMEM_FILL_RING 4 +#define XDP_UMEM_COMPLETION_RING 5 +#define XDP_STATISTICS 6 + +struct xdp_umem_reg { + __u64 addr; /* Start of packet data area */ + __u64 len; /* Length of packet data area */ + __u32 frame_size; /* Frame size */ + __u32 frame_headroom; /* Frame head room */ +}; + +struct xdp_statistics { + __u64 rx_dropped; /* Dropped for reasons other than invalid desc */ + __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */ + __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */ +}; + +/* Pgoff for mmaping the rings */ +#define XDP_PGOFF_RX_RING 0 +#define XDP_PGOFF_TX_RING 0x80000000 +#define XDP_UMEM_PGOFF_FILL_RING 0x100000000 +#define XDP_UMEM_PGOFF_COMPLETION_RING 0x180000000 + +struct xdp_desc { + __u32 idx; + __u32 len; + __u16 offset; + __u8 flags; + __u8 padding[5]; +}; + +struct xdp_ring { + __u32 producer __attribute__((aligned(64))); + __u32 consumer __attribute__((aligned(64))); +}; + +/* Used for the RX and TX queues for packets */ +struct xdp_rxtx_ring { + struct xdp_ring ptrs; + struct xdp_desc desc[0] __attribute__((aligned(64))); +}; + +/* Used for the fill and completion queues for buffers */ +struct xdp_umem_ring { + struct xdp_ring ptrs; + __u32 desc[0] __attribute__((aligned(64))); +}; + +#endif /* _LINUX_IF_XDP_H */ |
