diff options
author | Willem de Bruijn <willemb@google.com> | 2017-09-01 12:01:41 -0400 |
---|---|---|
committer | David S. Miller <davem@davemloft.net> | 2017-09-01 10:39:35 -0700 |
commit | cc8889ae8298ebfc6bbf52ad98fe3b5afdf4ae70 (patch) | |
tree | ecc15bc075f9f8a534274f240f9823fb97d13f7f /Documentation/networking | |
parent | 9df59055ed8e2f817149e786b2a52bc17832d84e (diff) |
doc: document MSG_ZEROCOPY
Documentation for this feature was missing from the patchset.
Copied a lot from the netdev 2.1 paper, addressing some small
interface changes since then.
Changes
v1 -> v2
- change email discussion URL format
- clarify that u32 counter is per-syscall, unsigned and
wraps after UINT_MAX calls
- describe errno on send failure specific to MSG_ZEROCOPY
- a few very minor rewordings
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Diffstat (limited to 'Documentation/networking')
-rw-r--r-- | Documentation/networking/msg_zerocopy.rst | 257 |
1 files changed, 257 insertions, 0 deletions
diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst new file mode 100644 index 000000000000..77f6d7e25cfd --- /dev/null +++ b/Documentation/networking/msg_zerocopy.rst @@ -0,0 +1,257 @@ + +============ +MSG_ZEROCOPY +============ + +Intro +===== + +The MSG_ZEROCOPY flag enables copy avoidance for socket send calls. +The feature is currently implemented for TCP sockets. + + +Opportunity and Caveats +----------------------- + +Copying large buffers between user process and kernel can be +expensive. Linux supports various interfaces that eschew copying, +such as sendpage and splice. The MSG_ZEROCOPY flag extends the +underlying copy avoidance mechanism to common socket send calls. + +Copy avoidance is not a free lunch. As implemented, with page pinning, +it replaces per byte copy cost with page accounting and completion +notification overhead. As a result, MSG_ZEROCOPY is generally only +effective at writes over around 10 KB. + +Page pinning also changes system call semantics. It temporarily shares +the buffer between process and network stack. Unlike with copying, the +process cannot immediately overwrite the buffer after system call +return without possibly modifying the data in flight. Kernel integrity +is not affected, but a buggy program can possibly corrupt its own data +stream. + +The kernel returns a notification when it is safe to modify data. +Converting an existing application to MSG_ZEROCOPY is not always as +trivial as just passing the flag, then. + + +More Info +--------- + +Much of this document was derived from a longer paper presented at +netdev 2.1. For more in-depth information see that paper and talk, +the excellent reporting over at LWN.net or read the original code. + + paper, slides, video + https://netdevconf.org/2.1/session.html?debruijn + + LWN article + https://lwn.net/Articles/726917/ + + patchset + [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY + http://lkml.kernel.org/r/20170803202945.70750-1-willemdebruijn.kernel@gmail.com + + +Interface +========= + +Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy +avoidance, but not the only one. + +Socket Setup +------------ + +The kernel is permissive when applications pass undefined flags to the +send system call. By default it simply ignores these. To avoid enabling +copy avoidance mode for legacy processes that accidentally already pass +this flag, a process must first signal intent by setting a socket option: + +:: + + if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one))) + error(1, errno, "setsockopt zerocopy"); + + +Transmission +------------ + +The change to send (or sendto, sendmsg, sendmmsg) itself is trivial. +Pass the new flag. + +:: + + ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY); + +A zerocopy failure will return -1 with errno ENOBUFS. This happens if +the socket option was not set, the socket exceeds its optmem limit or +the user exceeds its ulimit on locked pages. + + +Mixing copy avoidance and copying +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Many workloads have a mixture of large and small buffers. Because copy +avoidance is more expensive than copying for small packets, the +feature is implemented as a flag. It is safe to mix calls with the flag +with those without. + + +Notifications +------------- + +The kernel has to notify the process when it is safe to reuse a +previously passed buffer. It queues completion notifications on the +socket error queue, akin to the transmit timestamping interface. + +The notification itself is a simple scalar value. Each socket +maintains an internal unsigned 32-bit counter. Each send call with +MSG_ZEROCOPY that successfully sends data increments the counter. The +counter is not incremented on failure or if called with length zero. +The counter counts system call invocations, not bytes. It wraps after +UINT_MAX calls. + + +Notification Reception +~~~~~~~~~~~~~~~~~~~~~~ + +The below snippet demonstrates the API. In the simplest case, each +send syscall is followed by a poll and recvmsg on the error queue. + +Reading from the error queue is always a non-blocking operation. The +poll call is there to block until an error is outstanding. It will set +POLLERR in its output flags. That flag does not have to be set in the +events field. Errors are signaled unconditionally. + +:: + + pfd.fd = fd; + pfd.events = 0; + if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0) + error(1, errno, "poll"); + + ret = recvmsg(fd, &msg, MSG_ERRQUEUE); + if (ret == -1) + error(1, errno, "recvmsg"); + + read_notification(msg); + +The example is for demonstration purpose only. In practice, it is more +efficient to not wait for notifications, but read without blocking +every couple of send calls. + +Notifications can be processed out of order with other operations on +the socket. A socket that has an error queued would normally block +other operations until the error is read. Zerocopy notifications have +a zero error code, however, to not block send and recv calls. + + +Notification Batching +~~~~~~~~~~~~~~~~~~~~~ + +Multiple outstanding packets can be read at once using the recvmmsg +call. This is often not needed. In each message the kernel returns not +a single value, but a range. It coalesces consecutive notifications +while one is outstanding for reception on the error queue. + +When a new notification is about to be queued, it checks whether the +new value extends the range of the notification at the tail of the +queue. If so, it drops the new notification packet and instead increases +the range upper value of the outstanding notification. + +For protocols that acknowledge data in-order, like TCP, each +notification can be squashed into the previous one, so that no more +than one notification is outstanding at any one point. + +Ordered delivery is the common case, but not guaranteed. Notifications +may arrive out of order on retransmission and socket teardown. + + +Notification Parsing +~~~~~~~~~~~~~~~~~~~~ + +The below snippet demonstrates how to parse the control message: the +read_notification() call in the previous snippet. A notification +is encoded in the standard error format, sock_extended_err. + +The level and type fields in the control data are protocol family +specific, IP_RECVERR or IPV6_RECVERR. + +Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero, +as explained before, to avoid blocking read and write system calls on +the socket. + +The 32-bit notification range is encoded as [ee_info, ee_data]. This +range is inclusive. Other fields in the struct must be treated as +undefined, bar for ee_code, as discussed below. + +:: + + struct sock_extended_err *serr; + struct cmsghdr *cm; + + cm = CMSG_FIRSTHDR(msg); + if (cm->cmsg_level != SOL_IP && + cm->cmsg_type != IP_RECVERR) + error(1, 0, "cmsg"); + + serr = (void *) CMSG_DATA(cm); + if (serr->ee_errno != 0 || + serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) + error(1, 0, "serr"); + + printf("completed: %u..%u\n", serr->ee_info, serr->ee_data); + + +Deferred copies +~~~~~~~~~~~~~~~ + +Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy +avoidance, and a contract that the kernel will queue a completion +notification. It is not a guarantee that the copy is elided. + +Copy avoidance is not always feasible. Devices that do not support +scatter-gather I/O cannot send packets made up of kernel generated +protocol headers plus zerocopy user data. A packet may need to be +converted to a private copy of data deep in the stack, say to compute +a checksum. + +In all these cases, the kernel returns a completion notification when +it releases its hold on the shared pages. That notification may arrive +before the (copied) data is fully transmitted. A zerocopy completion +notification is not a transmit completion notification, therefore. + +Deferred copies can be more expensive than a copy immediately in the +system call, if the data is no longer warm in the cache. The process +also incurs notification processing cost for no benefit. For this +reason, the kernel signals if data was completed with a copy, by +setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return. +A process may use this signal to stop passing flag MSG_ZEROCOPY on +subsequent requests on the same socket. + + +Implementation +============== + +Loopback +-------- + +Data sent to local sockets can be queued indefinitely if the receive +process does not read its socket. Unbound notification latency is not +acceptable. For this reason all packets generated with MSG_ZEROCOPY +that are looped to a local socket will incur a deferred copy. This +includes looping onto packet sockets (e.g., tcpdump) and tun devices. + + +Testing +======= + +More realistic example code can be found in the kernel source under +tools/testing/selftests/net/msg_zerocopy.c. + +Be cognizant of the loopback constraint. The test can be run between +a pair of hosts. But if run between a local pair of processes, for +instance when run with msg_zerocopy.sh between a veth pair across +namespaces, the test will not show any improvement. For testing, the +loopback restriction can be temporarily relaxed by making +skb_orphan_frags_rx identical to skb_orphan_frags. |