summaryrefslogtreecommitdiff
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/batman-adv.txt3
-rw-r--r--Documentation/networking/ip-sysctl.txt56
-rw-r--r--Documentation/networking/packet_mmap.txt246
-rw-r--r--Documentation/networking/stmmac.txt28
4 files changed, 278 insertions, 55 deletions
diff --git a/Documentation/networking/batman-adv.txt b/Documentation/networking/batman-adv.txt
index a173d2a879f5..c1d82047a4b1 100644
--- a/Documentation/networking/batman-adv.txt
+++ b/Documentation/networking/batman-adv.txt
@@ -203,7 +203,8 @@ abled during run time. Following log_levels are defined:
2 - Enable messages related to route added / changed / deleted
4 - Enable messages related to translation table operations
8 - Enable messages related to bridge loop avoidance
-15 - enable all messages
+16 - Enable messaged related to DAT, ARP snooping and parsing
+31 - Enable all messages
The debug output can be changed at runtime using the file
/sys/class/net/bat0/mesh/log_level. e.g.
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index c7fc10724948..dd52d516cb89 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -30,16 +30,24 @@ neigh/default/gc_thresh3 - INTEGER
Maximum number of neighbor entries allowed. Increase this
when using large numbers of interfaces and when communicating
with large numbers of directly-connected peers.
+ Default: 1024
neigh/default/unres_qlen_bytes - INTEGER
The maximum number of bytes which may be used by packets
queued for each unresolved address by other network layers.
(added in linux 3.3)
+ Seting negative value is meaningless and will retrun error.
+ Default: 65536 Bytes(64KB)
neigh/default/unres_qlen - INTEGER
The maximum number of packets which may be queued for each
unresolved address by other network layers.
(deprecated in linux 3.3) : use unres_qlen_bytes instead.
+ Prior to linux 3.3, the default value is 3 which may cause
+ unexpected packet loss. The current default value is calculated
+ according to default value of unres_qlen_bytes and true size of
+ packet.
+ Default: 31
mtu_expires - INTEGER
Time, in seconds, that cached PMTU information is kept.
@@ -199,15 +207,16 @@ tcp_early_retrans - INTEGER
Default: 2
tcp_ecn - INTEGER
- Enable Explicit Congestion Notification (ECN) in TCP. ECN is only
- used when both ends of the TCP flow support it. It is useful to
- avoid losses due to congestion (when the bottleneck router supports
- ECN).
+ Control use of Explicit Congestion Notification (ECN) by TCP.
+ ECN is used only when both ends of the TCP connection indicate
+ support for it. This feature is useful in avoiding losses due
+ to congestion by allowing supporting routers to signal
+ congestion before having to drop packets.
Possible values are:
- 0 disable ECN
- 1 ECN enabled
- 2 Only server-side ECN enabled. If the other end does
- not support ECN, behavior is like with ECN disabled.
+ 0 Disable ECN. Neither initiate nor accept ECN.
+ 1 Always request ECN on outgoing connection attempts.
+ 2 Enable ECN when requested by incomming connections
+ but do not request ECN on outgoing connections.
Default: 2
tcp_fack - BOOLEAN
@@ -215,15 +224,14 @@ tcp_fack - BOOLEAN
The value is not used, if tcp_sack is not enabled.
tcp_fin_timeout - INTEGER
- Time to hold socket in state FIN-WAIT-2, if it was closed
- by our side. Peer can be broken and never close its side,
- or even died unexpectedly. Default value is 60sec.
- Usual value used in 2.2 was 180 seconds, you may restore
- it, but remember that if your machine is even underloaded WEB server,
- you risk to overflow memory with kilotons of dead sockets,
- FIN-WAIT-2 sockets are less dangerous than FIN-WAIT-1,
- because they eat maximum 1.5K of memory, but they tend
- to live longer. Cf. tcp_max_orphans.
+ The length of time an orphaned (no longer referenced by any
+ application) connection will remain in the FIN_WAIT_2 state
+ before it is aborted at the local end. While a perfectly
+ valid "receive only" state for an un-orphaned connection, an
+ orphaned connection in FIN_WAIT_2 state could otherwise wait
+ forever for the remote to close its end of the connection.
+ Cf. tcp_max_orphans
+ Default: 60 seconds
tcp_frto - INTEGER
Enables Forward RTO-Recovery (F-RTO) defined in RFC4138.
@@ -1514,6 +1522,20 @@ cookie_preserve_enable - BOOLEAN
Default: 1
+cookie_hmac_alg - STRING
+ Select the hmac algorithm used when generating the cookie value sent by
+ a listening sctp socket to a connecting client in the INIT-ACK chunk.
+ Valid values are:
+ * md5
+ * sha1
+ * none
+ Ability to assign md5 or sha1 as the selected alg is predicated on the
+ configuarion of those algorithms at build time (CONFIG_CRYPTO_MD5 and
+ CONFIG_CRYPTO_SHA1).
+
+ Default: Dependent on configuration. MD5 if available, else SHA1 if
+ available, else none.
+
rcvbuf_policy - INTEGER
Determines if the receive buffer is attributed to the socket or to
association. SCTP supports the capability to create multiple
diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index 1c08a4b0981f..94444b152fbc 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -3,9 +3,9 @@
--------------------------------------------------------------------------------
This file documents the mmap() facility available with the PACKET
-socket interface on 2.4 and 2.6 kernels. This type of sockets is used for
-capture network traffic with utilities like tcpdump or any other that needs
-raw access to network interface.
+socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for
+i) capture network traffic with utilities like tcpdump, ii) transmit network
+traffic, or any other that needs raw access to network interface.
You can find the latest version of this document at:
http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap
@@ -21,19 +21,18 @@ Please send your comments to
+ Why use PACKET_MMAP
--------------------------------------------------------------------------------
-In Linux 2.4/2.6 if PACKET_MMAP is not enabled, the capture process is very
-inefficient. It uses very limited buffers and requires one system call
-to capture each packet, it requires two if you want to get packet's
-timestamp (like libpcap always does).
+In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very
+inefficient. It uses very limited buffers and requires one system call to
+capture each packet, it requires two if you want to get packet's timestamp
+(like libpcap always does).
In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
configurable circular buffer mapped in user space that can be used to either
send or receive packets. This way reading packets just needs to wait for them,
most of the time there is no need to issue a single system call. Concerning
transmission, multiple packets can be sent through one system call to get the
-highest bandwidth.
-By using a shared buffer between the kernel and the user also has the benefit
-of minimizing packet copies.
+highest bandwidth. By using a shared buffer between the kernel and the user
+also has the benefit of minimizing packet copies.
It's fine to use PACKET_MMAP to improve the performance of the capture and
transmission process, but it isn't everything. At least, if you are capturing
@@ -41,7 +40,8 @@ at high speeds (this is relative to the cpu speed), you should check if the
device driver of your network interface card supports some sort of interrupt
load mitigation or (even better) if it supports NAPI, also make sure it is
enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
-supported by devices of your network.
+supported by devices of your network. CPU IRQ pinning of your network interface
+card can also be an advantage.
--------------------------------------------------------------------------------
+ How to use mmap() to improve capture process
@@ -87,9 +87,7 @@ the following process:
socket creation and destruction is straight forward, and is done
the same way with or without PACKET_MMAP:
-int fd;
-
-fd= socket(PF_PACKET, mode, htons(ETH_P_ALL))
+ int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL));
where mode is SOCK_RAW for the raw interface were link level
information can be captured or SOCK_DGRAM for the cooked
@@ -163,11 +161,23 @@ As capture, each frame contains two parts:
A complete tutorial is available at: http://wiki.gnu-log.net/
+By default, the user should put data at :
+ frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll)
+
+So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW),
+the beginning of the user data will be at :
+ frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
+
+If you wish to put user data at a custom offset from the beginning of
+the frame (for payload alignment with SOCK_RAW mode for instance) you
+can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order
+to make this work it must be enabled previously with setsockopt()
+and the PACKET_TX_HAS_OFF option.
+
--------------------------------------------------------------------------------
+ PACKET_MMAP settings
--------------------------------------------------------------------------------
-
To setup PACKET_MMAP from user level code is done with a call like
- Capture process
@@ -201,7 +211,6 @@ indeed, packet_set_ring checks that the following condition is true
frames_per_block * tp_block_nr == tp_frame_nr
-
Lets see an example, with the following values:
tp_block_size= 4096
@@ -227,7 +236,6 @@ be spawned across two blocks, so there are some details you have to take into
account when choosing the frame_size. See "Mapping and use of the circular
buffer (ring)".
-
--------------------------------------------------------------------------------
+ PACKET_MMAP setting constraints
--------------------------------------------------------------------------------
@@ -264,7 +272,6 @@ User space programs can include /usr/include/sys/user.h and
The pagesize can also be determined dynamically with the getpagesize (2)
system call.
-
Block number limit
--------------------
@@ -284,7 +291,6 @@ called pg_vec, its size limits the number of blocks that can be allocated.
v block #2
block #1
-
kmalloc allocates any number of bytes of physically contiguous memory from
a pool of pre-determined sizes. This pool of memory is maintained by the slab
allocator which is at the end the responsible for doing the allocation and
@@ -299,7 +305,6 @@ pointers to blocks is
131072/4 = 32768 blocks
-
PACKET_MMAP buffer size calculator
------------------------------------
@@ -340,7 +345,6 @@ and a value for <frame size> of 2048 bytes. These parameters will yield
and hence the buffer will have a 262144 MiB size. So it can hold
262144 MiB / 2048 bytes = 134217728 frames
-
Actually, this buffer size is not possible with an i386 architecture.
Remember that the memory is allocated in kernel space, in the case of
an i386 kernel's memory size is limited to 1GiB.
@@ -372,7 +376,6 @@ the following (from include/linux/if_packet.h):
- Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
- Pad to align to TPACKET_ALIGNMENT=16
*/
-
The following are conditions that are checked in packet_set_ring
@@ -413,7 +416,6 @@ and the following flags apply:
#define TP_STATUS_LOSING 4
#define TP_STATUS_CSUMNOTREADY 8
-
TP_STATUS_COPY : This flag indicates that the frame (and associated
meta information) has been truncated because it's
larger than tp_frame_size. This packet can be
@@ -462,7 +464,6 @@ packets are in the ring:
It doesn't incur in a race condition to first check the status value and
then poll for frames.
-
++ Transmission process
Those defines are also used for transmission:
@@ -494,6 +495,196 @@ The user can also use poll() to check if a buffer is available:
retval = poll(&pfd, 1, timeout);
-------------------------------------------------------------------------------
++ What TPACKET versions are available and when to use them?
+-------------------------------------------------------------------------------
+
+ int val = tpacket_version;
+ setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
+ getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
+
+where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
+
+TPACKET_V1:
+ - Default if not otherwise specified by setsockopt(2)
+ - RX_RING, TX_RING available
+ - VLAN metadata information available for packets
+ (TP_STATUS_VLAN_VALID)
+
+TPACKET_V1 --> TPACKET_V2:
+ - Made 64 bit clean due to unsigned long usage in TPACKET_V1
+ structures, thus this also works on 64 bit kernel with 32 bit
+ userspace and the like
+ - Timestamp resolution in nanoseconds instead of microseconds
+ - RX_RING, TX_RING available
+ - How to switch to TPACKET_V2:
+ 1. Replace struct tpacket_hdr by struct tpacket2_hdr
+ 2. Query header len and save
+ 3. Set protocol version to 2, set up ring as usual
+ 4. For getting the sockaddr_ll,
+ use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of
+ (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
+
+TPACKET_V2 --> TPACKET_V3:
+ - Flexible buffer implementation:
+ 1. Blocks can be configured with non-static frame-size
+ 2. Read/poll is at a block-level (as opposed to packet-level)
+ 3. Added poll timeout to avoid indefinite user-space wait
+ on idle links
+ 4. Added user-configurable knobs:
+ 4.1 block::timeout
+ 4.2 tpkt_hdr::sk_rxhash
+ - RX Hash data available in user space
+ - Currently only RX_RING available
+
+-------------------------------------------------------------------------------
++ AF_PACKET fanout mode
+-------------------------------------------------------------------------------
+
+In the AF_PACKET fanout mode, packet reception can be load balanced among
+processes. This also works in combination with mmap(2) on packet sockets.
+
+Minimal example code by David S. Miller (try things like "./test eth0 hash",
+"./test eth0 lb", etc.):
+
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+
+#include <unistd.h>
+
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+
+#include <net/if.h>
+
+static const char *device_name;
+static int fanout_type;
+static int fanout_id;
+
+#ifndef PACKET_FANOUT
+# define PACKET_FANOUT 18
+# define PACKET_FANOUT_HASH 0
+# define PACKET_FANOUT_LB 1
+#endif
+
+static int setup_socket(void)
+{
+ int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
+ struct sockaddr_ll ll;
+ struct ifreq ifr;
+ int fanout_arg;
+
+ if (fd < 0) {
+ perror("socket");
+ return EXIT_FAILURE;
+ }
+
+ memset(&ifr, 0, sizeof(ifr));
+ strcpy(ifr.ifr_name, device_name);
+ err = ioctl(fd, SIOCGIFINDEX, &ifr);
+ if (err < 0) {
+ perror("SIOCGIFINDEX");
+ return EXIT_FAILURE;
+ }
+
+ memset(&ll, 0, sizeof(ll));
+ ll.sll_family = AF_PACKET;
+ ll.sll_ifindex = ifr.ifr_ifindex;
+ err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
+ if (err < 0) {
+ perror("bind");
+ return EXIT_FAILURE;
+ }
+
+ fanout_arg = (fanout_id | (fanout_type << 16));
+ err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
+ &fanout_arg, sizeof(fanout_arg));
+ if (err) {
+ perror("setsockopt");
+ return EXIT_FAILURE;
+ }
+
+ return fd;
+}
+
+static void fanout_thread(void)
+{
+ int fd = setup_socket();
+ int limit = 10000;
+
+ if (fd < 0)
+ exit(fd);
+
+ while (limit-- > 0) {
+ char buf[1600];
+ int err;
+
+ err = read(fd, buf, sizeof(buf));
+ if (err < 0) {
+ perror("read");
+ exit(EXIT_FAILURE);
+ }
+ if ((limit % 10) == 0)
+ fprintf(stdout, "(%d) \n", getpid());
+ }
+
+ fprintf(stdout, "%d: Received 10000 packets\n", getpid());
+
+ close(fd);
+ exit(0);
+}
+
+int main(int argc, char **argp)
+{
+ int fd, err;
+ int i;
+
+ if (argc != 3) {
+ fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]);
+ return EXIT_FAILURE;
+ }
+
+ if (!strcmp(argp[2], "hash"))
+ fanout_type = PACKET_FANOUT_HASH;
+ else if (!strcmp(argp[2], "lb"))
+ fanout_type = PACKET_FANOUT_LB;
+ else {
+ fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
+ exit(EXIT_FAILURE);
+ }
+
+ device_name = argp[1];
+ fanout_id = getpid() & 0xffff;
+
+ for (i = 0; i < 4; i++) {
+ pid_t pid = fork();
+
+ switch (pid) {
+ case 0:
+ fanout_thread();
+
+ case -1:
+ perror("fork");
+ exit(EXIT_FAILURE);
+ }
+ }
+
+ for (i = 0; i < 4; i++) {
+ int status;
+
+ wait(&status);
+ }
+
+ return 0;
+}
+
+-------------------------------------------------------------------------------
+ PACKET_TIMESTAMP
-------------------------------------------------------------------------------
@@ -519,6 +710,13 @@ the networking stack is used (the behavior before this setting was added).
See include/linux/net_tstamp.h and Documentation/networking/timestamping
for more information on hardware timestamps.
+-------------------------------------------------------------------------------
++ Miscellaneous bits
+-------------------------------------------------------------------------------
+
+- Packet sockets work well together with Linux socket filters, thus you also
+ might want to have a look at Documentation/networking/filter.txt
+
--------------------------------------------------------------------------------
+ THANKS
--------------------------------------------------------------------------------
diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt
index ef9ee71b4d7f..f9fa6db40a52 100644
--- a/Documentation/networking/stmmac.txt
+++ b/Documentation/networking/stmmac.txt
@@ -29,11 +29,9 @@ The kernel configuration option is STMMAC_ETH:
dma_txsize: DMA tx ring size;
buf_sz: DMA buffer size;
tc: control the HW FIFO threshold;
- tx_coe: Enable/Disable Tx Checksum Offload engine;
watchdog: transmit timeout (in milliseconds);
flow_ctrl: Flow control ability [on/off];
pause: Flow Control Pause Time;
- tmrate: timer period (only if timer optimisation is configured).
3) Command line options
Driver parameters can be also passed in command line by using:
@@ -60,17 +58,19 @@ Then the poll method will be scheduled at some future point.
The incoming packets are stored, by the DMA, in a list of pre-allocated socket
buffers in order to avoid the memcpy (Zero-copy).
-4.3) Timer-Driver Interrupt
-Instead of having the device that asynchronously notifies the frame receptions,
-the driver configures a timer to generate an interrupt at regular intervals.
-Based on the granularity of the timer, the frames that are received by the
-device will experience different levels of latency. Some NICs have dedicated
-timer device to perform this task. STMMAC can use either the RTC device or the
-TMU channel 2 on STLinux platforms.
-The timers frequency can be passed to the driver as parameter; when change it,
-take care of both hardware capability and network stability/performance impact.
-Several performance tests on STM platforms showed this optimisation allows to
-spare the CPU while having the maximum throughput.
+4.3) Interrupt Mitigation
+The driver is able to mitigate the number of its DMA interrupts
+using NAPI for the reception on chips older than the 3.50.
+New chips have an HW RX-Watchdog used for this mitigation.
+
+On Tx-side, the mitigation schema is based on a SW timer that calls the
+tx function (stmmac_tx) to reclaim the resource after transmitting the
+frames.
+Also there is another parameter (like a threshold) used to program
+the descriptors avoiding to set the interrupt on completion bit in
+when the frame is sent (xmit).
+
+Mitigation parameters can be tuned by ethtool.
4.4) WOL
Wake up on Lan feature through Magic and Unicast frames are supported for the
@@ -121,6 +121,7 @@ struct plat_stmmacenet_data {
int bugged_jumbo;
int pmt;
int force_sf_dma_mode;
+ int riwt_off;
void (*fix_mac_speed)(void *priv, unsigned int speed);
void (*bus_setup)(void __iomem *ioaddr);
int (*init)(struct platform_device *pdev);
@@ -156,6 +157,7 @@ Where:
o pmt: core has the embedded power module (optional).
o force_sf_dma_mode: force DMA to use the Store and Forward mode
instead of the Threshold.
+ o riwt_off: force to disable the RX watchdog feature and switch to NAPI mode.
o fix_mac_speed: this callback is used for modifying some syscfg registers
(on ST SoCs) according to the link speed negotiated by the
physical layer .