summaryrefslogtreecommitdiffstats
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/6lowpan.txt4
-rw-r--r--Documentation/networking/af_xdp.rst312
-rw-r--r--Documentation/networking/bonding.txt2
-rw-r--r--Documentation/networking/e100.rst (renamed from Documentation/networking/e100.txt)60
-rw-r--r--Documentation/networking/e1000.rst (renamed from Documentation/networking/e1000.txt)59
-rw-r--r--Documentation/networking/failover.rst18
-rw-r--r--Documentation/networking/filter.txt21
-rw-r--r--Documentation/networking/gtp.txt4
-rw-r--r--Documentation/networking/ila.txt2
-rw-r--r--Documentation/networking/index.rst3
-rw-r--r--Documentation/networking/ip-sysctl.txt42
-rw-r--r--Documentation/networking/ipsec.txt4
-rw-r--r--Documentation/networking/ipvlan.txt4
-rw-r--r--Documentation/networking/kcm.txt10
-rw-r--r--Documentation/networking/net_failover.rst116
-rw-r--r--Documentation/networking/netdev-FAQ.txt9
-rw-r--r--Documentation/networking/netdev-features.txt7
-rw-r--r--Documentation/networking/nf_conntrack-sysctl.txt2
-rw-r--r--Documentation/networking/ppp_generic.txt6
19 files changed, 570 insertions, 115 deletions
diff --git a/Documentation/networking/6lowpan.txt b/Documentation/networking/6lowpan.txt
index a7dc7e939c7a9..2e5a939d7e6ff 100644
--- a/Documentation/networking/6lowpan.txt
+++ b/Documentation/networking/6lowpan.txt
@@ -24,10 +24,10 @@ enum lowpan_lltypes.
Example to evaluate the private usually you can do:
-static inline sturct lowpan_priv_foobar *
+static inline struct lowpan_priv_foobar *
lowpan_foobar_priv(struct net_device *dev)
{
- return (sturct lowpan_priv_foobar *)lowpan_priv(dev)->priv;
+ return (struct lowpan_priv_foobar *)lowpan_priv(dev)->priv;
}
switch (dev->type) {
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
new file mode 100644
index 0000000000000..ff929cfab4f49
--- /dev/null
+++ b/Documentation/networking/af_xdp.rst
@@ -0,0 +1,312 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======
+AF_XDP
+======
+
+Overview
+========
+
+AF_XDP is an address family that is optimized for high performance
+packet processing.
+
+This document assumes that the reader is familiar with BPF and XDP. If
+not, the Cilium project has an excellent reference guide at
+http://cilium.readthedocs.io/en/latest/bpf/.
+
+Using the XDP_REDIRECT action from an XDP program, the program can
+redirect ingress frames to other XDP enabled netdevs, using the
+bpf_redirect_map() function. AF_XDP sockets enable the possibility for
+XDP programs to redirect frames to a memory buffer in a user-space
+application.
+
+An AF_XDP socket (XSK) is created with the normal socket()
+syscall. Associated with each XSK are two rings: the RX ring and the
+TX ring. A socket can receive packets on the RX ring and it can send
+packets on the TX ring. These rings are registered and sized with the
+setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory
+to have at least one of these rings for each socket. An RX or TX
+descriptor ring points to a data buffer in a memory area called a
+UMEM. RX and TX can share the same UMEM so that a packet does not have
+to be copied between RX and TX. Moreover, if a packet needs to be kept
+for a while due to a possible retransmit, the descriptor that points
+to that packet can be changed to point to another and reused right
+away. This again avoids copying data.
+
+The UMEM consists of a number of equally sized chunks. A descriptor in
+one of the rings references a frame by referencing its addr. The addr
+is simply an offset within the entire UMEM region. The user space
+allocates memory for this UMEM using whatever means it feels is most
+appropriate (malloc, mmap, huge pages, etc). This memory area is then
+registered with the kernel using the new setsockopt XDP_UMEM_REG. The
+UMEM also has two rings: the FILL ring and the COMPLETION ring. The
+fill ring is used by the application to send down addr for the kernel
+to fill in with RX packet data. References to these frames will then
+appear in the RX ring once each packet has been received. The
+completion ring, on the other hand, contains frame addr that the
+kernel has transmitted completely and can now be used again by user
+space, for either TX or RX. Thus, the frame addrs appearing in the
+completion ring are addrs that were previously transmitted using the
+TX ring. In summary, the RX and FILL rings are used for the RX path
+and the TX and COMPLETION rings are used for the TX path.
+
+The socket is then finally bound with a bind() call to a device and a
+specific queue id on that device, and it is not until bind is
+completed that traffic starts to flow.
+
+The UMEM can be shared between processes, if desired. If a process
+wants to do this, it simply skips the registration of the UMEM and its
+corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind
+call and submits the XSK of the process it would like to share UMEM
+with as well as its own newly created XSK socket. The new process will
+then receive frame addr references in its own RX ring that point to
+this shared UMEM. Note that since the ring structures are
+single-consumer / single-producer (for performance reasons), the new
+process has to create its own socket with associated RX and TX rings,
+since it cannot share this with the other process. This is also the
+reason that there is only one set of FILL and COMPLETION rings per
+UMEM. It is the responsibility of a single process to handle the UMEM.
+
+How is then packets distributed from an XDP program to the XSKs? There
+is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The
+user-space application can place an XSK at an arbitrary place in this
+map. The XDP program can then redirect a packet to a specific index in
+this map and at this point XDP validates that the XSK in that map was
+indeed bound to that device and ring number. If not, the packet is
+dropped. If the map is empty at that index, the packet is also
+dropped. This also means that it is currently mandatory to have an XDP
+program loaded (and one XSK in the XSKMAP) to be able to get any
+traffic to user space through the XSK.
+
+AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
+driver does not have support for XDP, or XDP_SKB is explicitly chosen
+when loading the XDP program, XDP_SKB mode is employed that uses SKBs
+together with the generic XDP support and copies out the data to user
+space. A fallback mode that works for any network device. On the other
+hand, if the driver has support for XDP, it will be used by the AF_XDP
+code to provide better performance, but there is still a copy of the
+data into user space.
+
+Concepts
+========
+
+In order to use an AF_XDP socket, a number of associated objects need
+to be setup.
+
+Jonathan Corbet has also written an excellent article on LWN,
+"Accelerating networking with AF_XDP". It can be found at
+https://lwn.net/Articles/750845/.
+
+UMEM
+----
+
+UMEM is a region of virtual contiguous memory, divided into
+equal-sized frames. An UMEM is associated to a netdev and a specific
+queue id of that netdev. It is created and configured (chunk size,
+headroom, start address and size) by using the XDP_UMEM_REG setsockopt
+system call. A UMEM is bound to a netdev and queue id, via the bind()
+system call.
+
+An AF_XDP is socket linked to a single UMEM, but one UMEM can have
+multiple AF_XDP sockets. To share an UMEM created via one socket A,
+the next socket B can do this by setting the XDP_SHARED_UMEM flag in
+struct sockaddr_xdp member sxdp_flags, and passing the file descriptor
+of A to struct sockaddr_xdp member sxdp_shared_umem_fd.
+
+The UMEM has two single-producer/single-consumer rings, that are used
+to transfer ownership of UMEM frames between the kernel and the
+user-space application.
+
+Rings
+-----
+
+There are a four different kind of rings: Fill, Completion, RX and
+TX. All rings are single-producer/single-consumer, so the user-space
+application need explicit synchronization of multiple
+processes/threads are reading/writing to them.
+
+The UMEM uses two rings: Fill and Completion. Each socket associated
+with the UMEM must have an RX queue, TX queue or both. Say, that there
+is a setup with four sockets (all doing TX and RX). Then there will be
+one Fill ring, one Completion ring, four TX rings and four RX rings.
+
+The rings are head(producer)/tail(consumer) based rings. A producer
+writes the data ring at the index pointed out by struct xdp_ring
+producer member, and increasing the producer index. A consumer reads
+the data ring at the index pointed out by struct xdp_ring consumer
+member, and increasing the consumer index.
+
+The rings are configured and created via the _RING setsockopt system
+calls and mmapped to user-space using the appropriate offset to mmap()
+(XDP_PGOFF_RX_RING, XDP_PGOFF_TX_RING, XDP_UMEM_PGOFF_FILL_RING and
+XDP_UMEM_PGOFF_COMPLETION_RING).
+
+The size of the rings need to be of size power of two.
+
+UMEM Fill Ring
+~~~~~~~~~~~~~~
+
+The Fill ring is used to transfer ownership of UMEM frames from
+user-space to kernel-space. The UMEM addrs are passed in the ring. As
+an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has
+16 chunks and can pass addrs between 0 and 64k.
+
+Frames passed to the kernel are used for the ingress path (RX rings).
+
+The user application produces UMEM addrs to this ring. Note that the
+kernel will mask the incoming addr. E.g. for a chunk size of 2k, the
+log2(2048) LSB of the addr will be masked off, meaning that 2048, 2050
+and 3000 refers to the same chunk.
+
+
+UMEM Completetion Ring
+~~~~~~~~~~~~~~~~~~~~~~
+
+The Completion Ring is used transfer ownership of UMEM frames from
+kernel-space to user-space. Just like the Fill ring, UMEM indicies are
+used.
+
+Frames passed from the kernel to user-space are frames that has been
+sent (TX ring) and can be used by user-space again.
+
+The user application consumes UMEM addrs from this ring.
+
+
+RX Ring
+~~~~~~~
+
+The RX ring is the receiving side of a socket. Each entry in the ring
+is a struct xdp_desc descriptor. The descriptor contains UMEM offset
+(addr) and the length of the data (len).
+
+If no frames have been passed to kernel via the Fill ring, no
+descriptors will (or can) appear on the RX ring.
+
+The user application consumes struct xdp_desc descriptors from this
+ring.
+
+TX Ring
+~~~~~~~
+
+The TX ring is used to send frames. The struct xdp_desc descriptor is
+filled (index, length and offset) and passed into the ring.
+
+To start the transfer a sendmsg() system call is required. This might
+be relaxed in the future.
+
+The user application produces struct xdp_desc descriptors to this
+ring.
+
+XSKMAP / BPF_MAP_TYPE_XSKMAP
+----------------------------
+
+On XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that
+is used in conjunction with bpf_redirect_map() to pass the ingress
+frame to a socket.
+
+The user application inserts the socket into the map, via the bpf()
+system call.
+
+Note that if an XDP program tries to redirect to a socket that does
+not match the queue configuration and netdev, the frame will be
+dropped. E.g. an AF_XDP socket is bound to netdev eth0 and
+queue 17. Only the XDP program executing for eth0 and queue 17 will
+successfully pass data to the socket. Please refer to the sample
+application (samples/bpf/) in for an example.
+
+Usage
+=====
+
+In order to use AF_XDP sockets there are two parts needed. The
+user-space application and the XDP program. For a complete setup and
+usage example, please refer to the sample application. The user-space
+side is xdpsock_user.c and the XDP side xdpsock_kern.c.
+
+Naive ring dequeue and enqueue could look like this::
+
+ // struct xdp_rxtx_ring {
+ // __u32 *producer;
+ // __u32 *consumer;
+ // struct xdp_desc *desc;
+ // };
+
+ // struct xdp_umem_ring {
+ // __u32 *producer;
+ // __u32 *consumer;
+ // __u64 *desc;
+ // };
+
+ // typedef struct xdp_rxtx_ring RING;
+ // typedef struct xdp_umem_ring RING;
+
+ // typedef struct xdp_desc RING_TYPE;
+ // typedef __u64 RING_TYPE;
+
+ int dequeue_one(RING *ring, RING_TYPE *item)
+ {
+ __u32 entries = *ring->producer - *ring->consumer;
+
+ if (entries == 0)
+ return -1;
+
+ // read-barrier!
+
+ *item = ring->desc[*ring->consumer & (RING_SIZE - 1)];
+ (*ring->consumer)++;
+ return 0;
+ }
+
+ int enqueue_one(RING *ring, const RING_TYPE *item)
+ {
+ u32 free_entries = RING_SIZE - (*ring->producer - *ring->consumer);
+
+ if (free_entries == 0)
+ return -1;
+
+ ring->desc[*ring->producer & (RING_SIZE - 1)] = *item;
+
+ // write-barrier!
+
+ (*ring->producer)++;
+ return 0;
+ }
+
+
+For a more optimized version, please refer to the sample application.
+
+Sample application
+==================
+
+There is a xdpsock benchmarking/test application included that
+demonstrates how to use AF_XDP sockets with both private and shared
+UMEMs. Say that you would like your UDP traffic from port 4242 to end
+up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
+for this::
+
+ ethtool -N p3p2 rx-flow-hash udp4 fn
+ ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
+ action 16
+
+Running the rxdrop benchmark in XDP_DRV mode can then be done
+using::
+
+ samples/bpf/xdpsock -i p3p2 -q 16 -r -N
+
+For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
+can be displayed with "-h", as usual.
+
+Credits
+=======
+
+- Björn Töpel (AF_XDP core)
+- Magnus Karlsson (AF_XDP core)
+- Alexander Duyck
+- Alexei Starovoitov
+- Daniel Borkmann
+- Jesper Dangaard Brouer
+- John Fastabend
+- Jonathan Corbet (LWN coverage)
+- Michael S. Tsirkin
+- Qi Z Zhang
+- Willem de Bruijn
+
diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index 9ba04c0bab8db..c13214d073a48 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -140,7 +140,7 @@ bonding module at load time, or are specified via sysfs.
Module options may be given as command line arguments to the
insmod or modprobe command, but are usually specified in either the
-/etc/modrobe.d/*.conf configuration files, or in a distro-specific
+/etc/modprobe.d/*.conf configuration files, or in a distro-specific
configuration file (some of which are detailed in the next section).
Details on bonding support for sysfs is provided in the
diff --git a/Documentation/networking/e100.txt b/Documentation/networking/e100.rst
index 54810b82c01af..d4d8370279254 100644
--- a/Documentation/networking/e100.txt
+++ b/Documentation/networking/e100.rst
@@ -1,7 +1,7 @@
Linux* Base Driver for the Intel(R) PRO/100 Family of Adapters
==============================================================
-March 15, 2011
+June 1, 2018
Contents
========
@@ -36,16 +36,9 @@ Channel Bonding documentation can be found in the Linux kernel source:
Identifying Your Adapter
========================
-For more information on how to identify your adapter, go to the Adapter &
-Driver ID Guide at:
-
- http://support.intel.com/support/network/adapter/pro100/21397.htm
-
-For the latest Intel network drivers for Linux, refer to the following
-website. In the search field, enter your adapter name or type, or use the
-networking link on the left to search for your adapter:
-
- http://downloadfinder.intel.com/scripts-df/support_intel.asp
+For information on how to identify your adapter, and for the latest Intel
+network drivers, refer to the Intel Support website:
+http://www.intel.com/support
Driver Configuration Parameters
===============================
@@ -57,22 +50,26 @@ Rx Descriptors: Number of receive descriptors. A receive descriptor is a data
structure that describes a receive buffer and its attributes to the network
controller. The data in the descriptor is used by the controller to write
data from the controller to host memory. In the 3.x.x driver the valid range
- for this parameter is 64-256. The default value is 64. This parameter can be
- changed using the command:
+ for this parameter is 64-256. The default value is 256. This parameter can be
+ changed using the command::
- ethtool -G eth? rx n, where n is the number of desired rx descriptors.
+ ethtool -G eth? rx n
+
+ Where n is the number of desired Rx descriptors.
Tx Descriptors: Number of transmit descriptors. A transmit descriptor is a data
structure that describes a transmit buffer and its attributes to the network
controller. The data in the descriptor is used by the controller to read
data from the host memory to the controller. In the 3.x.x driver the valid
- range for this parameter is 64-256. The default value is 64. This parameter
- can be changed using the command:
+ range for this parameter is 64-256. The default value is 128. This parameter
+ can be changed using the command::
+
+ ethtool -G eth? tx n
- ethtool -G eth? tx n, where n is the number of desired tx descriptors.
+ Where n is the number of desired Tx descriptors.
Speed/Duplex: The driver auto-negotiates the link speed and duplex settings by
- default. The ethtool utility can be used as follows to force speed/duplex.
+ default. The ethtool utility can be used as follows to force speed/duplex.::
ethtool -s eth? autoneg off speed {10|100} duplex {full|half}
@@ -81,7 +78,7 @@ Speed/Duplex: The driver auto-negotiates the link speed and duplex settings by
Event Log Message Level: The driver uses the message level flag to log events
to syslog. The message level can be set at driver load time. It can also be
- set using the command:
+ set using the command::
ethtool -s eth? msglvl n
@@ -112,9 +109,9 @@ Additional Configurations
---------------------
In order to see link messages and other Intel driver information on your
console, you must set the dmesg level up to six. This can be done by
- entering the following on the command line before loading the e100 driver:
+ entering the following on the command line before loading the e100 driver::
- dmesg -n 8
+ dmesg -n 6
If you wish to see all messages issued by the driver, including debug
messages, set the dmesg level to eight.
@@ -146,7 +143,8 @@ Additional Configurations
NAPI (Rx polling mode) is supported in the e100 driver.
- See www.cyberus.ca/~hadi/usenix-paper.tgz for more information on NAPI.
+ See https://wiki.linuxfoundation.org/networking/napi for more information
+ on NAPI.
Multiple Interfaces on Same Ethernet Broadcast Network
------------------------------------------------------
@@ -160,7 +158,7 @@ Additional Configurations
If you have multiple interfaces in a server, either turn on ARP
filtering by
- (1) entering: echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter
+ (1) entering:: echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter
(this only works if your kernel's version is higher than 2.4.5), or
(2) installing the interfaces in separate broadcast domains (either
@@ -169,15 +167,11 @@ Additional Configurations
Support
=======
-
For general information, go to the Intel support website at:
+http://www.intel.com/support/
- http://support.intel.com
-
- or the Intel Wired Networking project hosted by Sourceforge at:
-
- http://sourceforge.net/projects/e1000
-
-If an issue is identified with the released source code on the supported
-kernel with a supported adapter, email the specific information related to the
-issue to e1000-devel@lists.sourceforge.net.
+or the Intel Wired Networking project hosted by Sourceforge at:
+http://sourceforge.net/projects/e1000
+If an issue is identified with the released source code on a supported kernel
+with a supported adapter, email the specific information related to the issue
+to e1000-devel@lists.sf.net.
diff --git a/Documentation/networking/e1000.txt b/Documentation/networking/e1000.rst
index 1f6ed848363dc..616848940e63f 100644
--- a/Documentation/networking/e1000.txt
+++ b/Documentation/networking/e1000.rst
@@ -154,7 +154,7 @@ NOTE: When e1000 is loaded with default settings and multiple adapters
are in use simultaneously, the CPU utilization may increase non-
linearly. In order to limit the CPU utilization without impacting
the overall throughput, we recommend that you load the driver as
- follows:
+ follows::
modprobe e1000 InterruptThrottleRate=3000,3000,3000
@@ -167,8 +167,8 @@ NOTE: When e1000 is loaded with default settings and multiple adapters
RxDescriptors
-------------
-Valid Range: 80-256 for 82542 and 82543-based adapters
- 80-4096 for all other supported adapters
+Valid Range: 48-256 for 82542 and 82543-based adapters
+ 48-4096 for all other supported adapters
Default Value: 256
This value specifies the number of receive buffer descriptors allocated
@@ -230,8 +230,8 @@ speed. Duplex should also be set when Speed is set to either 10 or 100.
TxDescriptors
-------------
-Valid Range: 80-256 for 82542 and 82543-based adapters
- 80-4096 for all other supported adapters
+Valid Range: 48-256 for 82542 and 82543-based adapters
+ 48-4096 for all other supported adapters
Default Value: 256
This value is the number of transmit descriptors allocated by the driver.
@@ -242,41 +242,10 @@ NOTE: Depending on the available system resources, the request for a
higher number of transmit descriptors may be denied. In this case,
use a lower number.
-TxDescriptorStep
-----------------
-Valid Range: 1 (use every Tx Descriptor)
- 4 (use every 4th Tx Descriptor)
-
-Default Value: 1 (use every Tx Descriptor)
-
-On certain non-Intel architectures, it has been observed that intense TX
-traffic bursts of short packets may result in an improper descriptor
-writeback. If this occurs, the driver will report a "TX Timeout" and reset
-the adapter, after which the transmit flow will restart, though data may
-have stalled for as much as 10 seconds before it resumes.
-
-The improper writeback does not occur on the first descriptor in a system
-memory cache-line, which is typically 32 bytes, or 4 descriptors long.
-
-Setting TxDescriptorStep to a value of 4 will ensure that all TX descriptors
-are aligned to the start of a system memory cache line, and so this problem
-will not occur.
-
-NOTES: Setting TxDescriptorStep to 4 effectively reduces the number of
- TxDescriptors available for transmits to 1/4 of the normal allocation.
- This has a possible negative performance impact, which may be
- compensated for by allocating more descriptors using the TxDescriptors
- module parameter.
-
- There are other conditions which may result in "TX Timeout", which will
- not be resolved by the use of the TxDescriptorStep parameter. As the
- issue addressed by this parameter has never been observed on Intel
- Architecture platforms, it should not be used on Intel platforms.
-
TxIntDelay
----------
Valid Range: 0-65535 (0=off)
-Default Value: 64
+Default Value: 8
This value delays the generation of transmit interrupts in units of
1.024 microseconds. Transmit interrupt reduction can improve CPU
@@ -288,7 +257,7 @@ TxAbsIntDelay
-------------
(This parameter is supported only on 82540, 82545 and later adapters.)
Valid Range: 0-65535 (0=off)
-Default Value: 64
+Default Value: 32
This value, in units of 1.024 microseconds, limits the delay in which a
transmit interrupt is generated. Useful only if TxIntDelay is non-zero,
@@ -310,7 +279,7 @@ Copybreak
---------
Valid Range: 0-xxxxxxx (0=off)
Default Value: 256
-Usage: insmod e1000.ko copybreak=128
+Usage: modprobe e1000.ko copybreak=128
Driver copies all packets below or equaling this size to a fresh RX
buffer before handing it up the stack.
@@ -328,14 +297,6 @@ Default Value: 0 (disabled)
Allows PHY to turn off in lower power states. The user can turn off
this parameter in supported chipsets.
-KumeranLockLoss
----------------
-Valid Range: 0-1
-Default Value: 1 (enabled)
-
-This workaround skips resetting the PHY at shutdown for the initial
-silicon releases of ICH8 systems.
-
Speed and Duplex Configuration
==============================
@@ -397,12 +358,12 @@ Additional Configurations
------------
Jumbo Frames support is enabled by changing the MTU to a value larger than
the default of 1500. Use the ifconfig command to increase the MTU size.
- For example:
+ For example::
ifconfig eth<x> mtu 9000 up
This setting is not saved across reboots. It can be made permanent if
- you add:
+ you add::
MTU=9000
diff --git a/Documentation/networking/failover.rst b/Documentation/networking/failover.rst
new file mode 100644
index 0000000000000..f0c8483cdbf59
--- /dev/null
+++ b/Documentation/networking/failover.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========
+FAILOVER
+========
+
+Overview
+========
+
+The failover module provides a generic interface for paravirtual drivers
+to register a netdev and a set of ops with a failover instance. The ops
+are used as event handlers that get called to handle netdev register/
+unregister/link change/name change events on slave pci ethernet devices
+with the same mac address as the failover netdev.
+
+This enables paravirtual drivers to use a VF as an accelerated low latency
+datapath. It also allows live migration of VMs with direct attached VFs by
+failing over to the paravirtual datapath when the VF is unplugged.
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index fd55c7de99910..e6b4ebb2b2438 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -483,6 +483,12 @@ Example output from dmesg:
[ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00
[ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3
+When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and
+setting any other value than that will return in failure. This is even the case for
+setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log
+is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the
+generally recommended approach instead.
+
In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for
generating disassembly out of the kernel log's hexdump:
@@ -1136,6 +1142,7 @@ into a register from memory, the register's top 56 bits are known zero, while
the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we
then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0;
0x1ff), because of potential carries.
+
Besides arithmetic, the register state can also be updated by conditional
branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch
it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false'
@@ -1144,14 +1151,16 @@ BPF_JSGE) would instead update the signed minimum/maximum values. Information
from the signed and unsigned bounds can be combined; for instance if a value is
first tested < 8 and then tested s> 4, the verifier will conclude that the value
is also > 4 and s< 8, since the bounds prevent crossing the sign boundary.
+
PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all
pointers sharing that same variable offset. This is important for packet range
-checks: after adding some variable to a packet pointer, if you then copy it to
-another register and (say) add a constant 4, both registers will share the same
-'id' but one will have a fixed offset of +4. Then if it is bounds-checked and
-found to be less than a PTR_TO_PACKET_END, the other register is now known to
-have a safe range of at least 4 bytes. See 'Direct packet access', below, for
-more on PTR_TO_PACKET ranges.
+checks: after adding a variable to a packet pointer register A, if you then copy
+it to another register B and then add a constant 4 to A, both registers will
+share the same 'id' but the A will have a fixed offset of +4. Then if A is
+bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is
+now known to have a safe range of at least 4 bytes. See 'Direct packet access',
+below, for more on PTR_TO_PACKET ranges.
+
The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of
the pointer returned from a map lookup. This means that when one copy is
checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs.
diff --git a/Documentation/networking/gtp.txt b/Documentation/networking/gtp.txt
index 0d9c18f05ec61..6966bbec1ecb7 100644
--- a/Documentation/networking/gtp.txt
+++ b/Documentation/networking/gtp.txt
@@ -67,7 +67,7 @@ Don't be confused by terminology: The GTP User Plane goes through
kernel accelerated path, while the GTP Control Plane goes to
Userspace :)
-The official homepge of the module is at
+The official homepage of the module is at
https://osmocom.org/projects/linux-kernel-gtp-u/wiki
== Userspace Programs with Linux Kernel GTP-U support ==
@@ -120,7 +120,7 @@ If yo have questions regarding how to use the Kernel GTP module from
your own software, or want to contribute to the code, please use the
osmocom-net-grps mailing list for related discussion. The list can be
reached at osmocom-net-gprs@lists.osmocom.org and the mailman
-interface for managign your subscription is at
+interface for managing your subscription is at
https://lists.osmocom.org/mailman/listinfo/osmocom-net-gprs
== Issue Tracker ==
diff --git a/Documentation/networking/ila.txt b/Documentation/networking/ila.txt
index 78df879abd263..a17dac9dc915d 100644
--- a/Documentation/networking/ila.txt
+++ b/Documentation/networking/ila.txt
@@ -121,7 +121,7 @@ three options to deal with this:
- checksum neutral mapping
When an address is translated the difference can be offset
- elsewhere in a part of the packet that is covered by the
+ elsewhere in a part of the packet that is covered by
the checksum. The low order sixteen bits of the identifier
are used. This method is preferred since it doesn't require
parsing a packet beyond the IP header and in most cases the
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index f204eaff657d8..fec8588a588ee 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -6,9 +6,12 @@ Contents:
.. toctree::
:maxdepth: 2
+ af_xdp
batman-adv
can
dpaa2/index
+ e100
+ e1000
kapi
z8530book
msg_zerocopy
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 35ffaa281b269..ce8fbf5aa63ca 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -26,7 +26,7 @@ ip_no_pmtu_disc - INTEGER
discarded. Outgoing frames are handled the same as in mode 1,
implicitly setting IP_PMTUDISC_DONT on every created socket.
- Mode 3 is a hardend pmtu discover mode. The kernel will only
+ Mode 3 is a hardened pmtu discover mode. The kernel will only
accept fragmentation-needed errors if the underlying protocol
can verify them besides a plain socket lookup. Current
protocols for which pmtu events will be honored are TCP, SCTP
@@ -449,8 +449,10 @@ tcp_recovery - INTEGER
features.
RACK: 0x1 enables the RACK loss detection for fast detection of lost
- retransmissions and tail drops.
+ retransmissions and tail drops. It also subsumes and disables
+ RFC6675 recovery for SACK connections.
RACK: 0x2 makes RACK's reordering window static (min_rtt/4).
+ RACK: 0x4 disables RACK's DUPACK threshold heuristic
Default: 0x1
@@ -523,6 +525,19 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max
tcp_sack - BOOLEAN
Enable select acknowledgments (SACKS).
+tcp_comp_sack_delay_ns - LONG INTEGER
+ TCP tries to reduce number of SACK sent, using a timer
+ based on 5% of SRTT, capped by this sysctl, in nano seconds.
+ The default is 1ms, based on TSO autosizing period.
+
+ Default : 1,000,000 ns (1 ms)
+
+tcp_comp_sack_nr - INTEGER
+ Max numer of SACK that can be compressed.
+ Using 0 disables SACK compression.
+
+ Detault : 44
+
tcp_slow_start_after_idle - BOOLEAN
If set, provide RFC2861 behavior and time out the congestion
window after an idle period. An idle period is defined at
@@ -652,11 +667,15 @@ tcp_tso_win_divisor - INTEGER
building larger TSO frames.
Default: 3
-tcp_tw_reuse - BOOLEAN
- Allow to reuse TIME-WAIT sockets for new connections when it is
- safe from protocol viewpoint. Default value is 0.
+tcp_tw_reuse - INTEGER
+ Enable reuse of TIME-WAIT sockets for new connections when it is
+ safe from protocol viewpoint.
+ 0 - disable
+ 1 - global enable
+ 2 - enable for loopback traffic only
It should not be changed without advice/request of technical
experts.
+ Default: 2
tcp_window_scaling - BOOLEAN
Enable window scaling as defined in RFC1323.
@@ -1428,6 +1447,19 @@ ip6frag_low_thresh - INTEGER
ip6frag_time - INTEGER
Time in seconds to keep an IPv6 fragment in memory.
+IPv6 Segment Routing:
+
+seg6_flowlabel - INTEGER
+ Controls the behaviour of computing the flowlabel of outer
+ IPv6 header in case of SR T.encaps
+
+ -1 set flowlabel to zero.
+ 0 copy flowlabel from Inner packet in case of Inner IPv6
+ (Set flowlabel to 0 in case IPv4/L2)
+ 1 Compute the flowlabel using seg6_make_flowlabel()
+
+ Default is 0.
+
conf/default/*:
Change the interface-specific default settings.
diff --git a/Documentation/networking/ipsec.txt b/Documentation/networking/ipsec.txt
index 8dbc08b7e431d..ba794b7e51be8 100644
--- a/Documentation/networking/ipsec.txt
+++ b/Documentation/networking/ipsec.txt
@@ -25,8 +25,8 @@ Quote from RFC3173:
is implementation dependent.
Current IPComp implementation is indeed by the book, while as in practice
-when sending non-compressed packet to the peer(whether or not packet len
-is smaller than the threshold or the compressed len is large than original
+when sending non-compressed packet to the peer (whether or not packet len
+is smaller than the threshold or the compressed len is larger than original
packet len), the packet is dropped when checking the policy as this packet
matches the selector but not coming from any XFRM layer, i.e., with no
security path. Such naked packet will not eventually make it to upper layer.
diff --git a/Documentation/networking/ipvlan.txt b/Documentation/networking/ipvlan.txt
index 812ef003e0a86..27a38e50c2874 100644
--- a/Documentation/networking/ipvlan.txt
+++ b/Documentation/networking/ipvlan.txt
@@ -73,11 +73,11 @@ mode to make conn-tracking work.
This is the default option. To configure the IPvlan port in this mode,
user can choose to either add this option on the command-line or don't specify
anything. This is the traditional mode where slaves can cross-talk among
-themseleves apart from talking through the master device.
+themselves apart from talking through the master device.
5.2 private:
If this option is added to the command-line, the port is set in private
-mode. i.e. port wont allow cross communication between slaves.
+mode. i.e. port won't allow cross communication between slaves.
5.3 vepa:
If this is added to the command-line, the port is set in VEPA mode.
diff --git a/Documentation/networking/kcm.txt b/Documentation/networking/kcm.txt
index 9a513295b07c7..b773a5278ac41 100644
--- a/Documentation/networking/kcm.txt
+++ b/Documentation/networking/kcm.txt
@@ -1,4 +1,4 @@
-Kernel Connection Mulitplexor
+Kernel Connection Multiplexor
-----------------------------
Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based
@@ -31,7 +31,7 @@ KCM implements an NxM multiplexor in the kernel as diagrammed below:
KCM sockets
-----------
-The KCM sockets provide the user interface to the muliplexor. All the KCM sockets
+The KCM sockets provide the user interface to the multiplexor. All the KCM sockets
bound to a multiplexor are considered to have equivalent function, and I/O
operations in different sockets may be done in parallel without the need for
synchronization between threads in userspace.
@@ -199,7 +199,7 @@ while. Example use:
BFP programs for message delineation
------------------------------------
-BPF programs can be compiled using the BPF LLVM backend. For exmple,
+BPF programs can be compiled using the BPF LLVM backend. For example,
the BPF program for parsing Thrift is:
#include "bpf.h" /* for __sk_buff */
@@ -222,7 +222,7 @@ messages. The kernel provides necessary assurances that messages are sent
and received atomically. This relieves much of the burden applications have
in mapping a message based protocol onto the TCP stream. KCM also make
application layer messages a unit of work in the kernel for the purposes of
-steerng and scheduling, which in turn allows a simpler networking model in
+steering and scheduling, which in turn allows a simpler networking model in
multithreaded applications.
Configurations
@@ -272,7 +272,7 @@ on the socket thus waking up the application thread. When the application
sees the error (which may just be a disconnect) it should unattach the
socket from KCM and then close it. It is assumed that once an error is
posted on the TCP socket the data stream is unrecoverable (i.e. an error
-may have occurred in the middle of receiving a messssge).
+may have occurred in the middle of receiving a message).
TCP connection monitoring
-------------------------
diff --git a/Documentation/networking/net_failover.rst b/Documentation/networking/net_failover.rst
new file mode 100644
index 0000000000000..70ca2f5800c43
--- /dev/null
+++ b/Documentation/networking/net_failover.rst
@@ -0,0 +1,116 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+NET_FAILOVER
+============
+
+Overview
+========
+
+The net_failover driver provides an automated failover mechanism via APIs
+to create and destroy a failover master netdev and mananges a primary and
+standby slave netdevs that get registered via the generic failover
+infrastructrure.
+
+The failover netdev acts a master device and controls 2 slave devices. The
+original paravirtual interface is registered as 'standby' slave netdev and
+a passthru/vf device with the same MAC gets registered as 'primary' slave
+netdev. Both 'standby' and 'failover' netdevs are associated with the same
+'pci' device. The user accesses the network interface via 'failover' netdev.
+The 'failover' netdev chooses 'primary' netdev as default for transmits when
+it is available with link up and running.
+
+This can be used by paravirtual drivers to enable an alternate low latency
+datapath. It also enables hypervisor controlled live migration of a VM with
+direct attached VF by failing over to the paravirtual datapath when the VF
+is unplugged.
+
+virtio-net accelerated datapath: STANDBY mode
+=============================================
+
+net_failover enables hypervisor controlled accelerated datapath to virtio-net
+enabled VMs in a transparent manner with no/minimal guest userspace chanages.
+
+To support this, the hypervisor needs to enable VIRTIO_NET_F_STANDBY
+feature on the virtio-net interface and assign the same MAC address to both
+virtio-net and VF interfaces.
+
+Here is an example XML snippet that shows such configuration.
+
+ <interface type='network'>
+ <mac address='52:54:00:00:12:53'/>
+ <source network='enp66s0f0_br'/>
+ <target dev='tap01'/>
+ <model type='virtio'/>
+ <driver name='vhost' queues='4'/>
+ <link state='down'/>
+ <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
+ </interface>
+ <interface type='hostdev' managed='yes'>
+ <mac address='52:54:00:00:12:53'/>
+ <source>
+ <address type='pci' domain='0x0000' bus='0x42' slot='0x02' function='0x5'/>
+ </source>
+ <address type='pci' domain='0x0000' bus='0x00' slot='0x0b' function='0x0'/>
+ </interface>
+
+Booting a VM with the above configuration will result in the following 3
+netdevs created in the VM.
+
+4: ens10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
+ link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff
+ inet 192.168.12.53/24 brd 192.168.12.255 scope global dynamic ens10
+ valid_lft 42482sec preferred_lft 42482sec
+ inet6 fe80::97d8:db2:8c10:b6d6/64 scope link
+ valid_lft forever preferred_lft forever
+5: ens10nsby: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ens10 state UP group default qlen 1000
+ link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff
+7: ens11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ens10 state UP group default qlen 1000
+ link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff
+
+ens10 is the 'failover' master netdev, ens10nsby and ens11 are the slave
+'standby' and 'primary' netdevs respectively.
+
+Live Migration of a VM with SR-IOV VF & virtio-net in STANDBY mode
+==================================================================
+
+net_failover also enables hypervisor controlled live migration to be supported
+with VMs that have direct attached SR-IOV VF devices by automatic failover to
+the paravirtual datapath when the VF is unplugged.
+
+Here is a sample script that shows the steps to initiate live migration on
+the source hypervisor.
+
+# cat vf_xml
+<interface type='hostdev' managed='yes'>
+ <mac address='52:54:00:00:12:53'/>
+ <source>
+ <address type='pci' domain='0x0000' bus='0x42' slot='0x02' function='0x5'/>
+ </source>
+ <address type='pci' domain='0x0000' bus='0x00' slot='0x0b' function='0x0'/>
+</interface>
+
+# Source Hypervisor
+#!/bin/bash
+
+DOMAIN=fedora27-tap01
+PF=enp66s0f0
+VF_NUM=5
+TAP_IF=tap01
+VF_XML=
+
+MAC=52:54:00:00:12:53
+ZERO_MAC=00:00:00:00:00:00
+
+virsh domif-setlink $DOMAIN $TAP_IF up
+bridge fdb del $MAC dev $PF master
+virsh detach-device $DOMAIN $VF_XML
+ip link set $PF vf $VF_NUM mac $ZERO_MAC
+
+virsh migrate --live $DOMAIN qemu+ssh://$REMOTE_HOST/system
+
+# Destination Hypervisor
+#!/bin/bash
+
+virsh attach-device $DOMAIN $VF_XML
+virsh domif-setlink $DOMAIN $TAP_IF down
diff --git a/Documentation/networking/netdev-FAQ.txt b/Documentation/networking/netdev-FAQ.txt
index 2a3278d5cf353..fa951b820b25d 100644
--- a/Documentation/networking/netdev-FAQ.txt
+++ b/Documentation/networking/netdev-FAQ.txt
@@ -179,6 +179,15 @@ A: No. See above answer. In short, if you think it really belongs in
dash marker line as described in Documentation/process/submitting-patches.rst to
temporarily embed that information into the patch that you send.
+Q: Are all networking bug fixes backported to all stable releases?
+
+A: Due to capacity, Dave could only take care of the backports for the last
+ 2 stable releases. For earlier stable releases, each stable branch maintainer
+ is supposed to take care of them. If you find any patch is missing from an
+ earlier stable branch, please notify stable@vger.kernel.org with either a
+ commit ID or a formal patch backported, and CC Dave and other relevant
+ networking developers.
+
Q: Someone said that the comment style and coding convention is different
for the networking content. Is this true?
diff --git a/Documentation/networking/netdev-features.txt b/Documentation/networking/netdev-features.txt
index c77f9d57eb915..c4a54c162547d 100644
--- a/Documentation/networking/netdev-features.txt
+++ b/Documentation/networking/netdev-features.txt
@@ -113,6 +113,13 @@ whatever headers there might be.
NETIF_F_TSO_ECN means that hardware can properly split packets with CWR bit
set, be it TCPv4 (when NETIF_F_TSO is enabled) or TCPv6 (NETIF_F_TSO6).
+ * Transmit UDP segmentation offload
+
+NETIF_F_GSO_UDP_GSO_L4 accepts a single UDP header with a payload that exceeds
+gso_size. On segmentation, it segments the payload on gso_size boundaries and
+replicates the network and UDP headers (fixing up the last one if less than
+gso_size).
+
* Transmit DMA from high memory
On platforms where this is relevant, NETIF_F_HIGHDMA signals that
diff --git a/Documentation/networking/nf_conntrack-sysctl.txt b/Documentation/networking/nf_conntrack-sysctl.txt
index 433b6724797ad..1669dc2419fdc 100644
--- a/Documentation/networking/nf_conntrack-sysctl.txt
+++ b/Documentation/networking/nf_conntrack-sysctl.txt
@@ -156,7 +156,7 @@ nf_conntrack_timestamp - BOOLEAN
nf_conntrack_udp_timeout - INTEGER (seconds)
default 30
-nf_conntrack_udp_timeout_stream2 - INTEGER (seconds)
+nf_conntrack_udp_timeout_stream - INTEGER (seconds)
default 180
This extended timeout will be used in case there is an UDP stream
diff --git a/Documentation/networking/ppp_generic.txt b/Documentation/networking/ppp_generic.txt
index 091d20273dcbb..61daf4b396009 100644
--- a/Documentation/networking/ppp_generic.txt
+++ b/Documentation/networking/ppp_generic.txt
@@ -300,12 +300,6 @@ unattached instance are:
The ioctl calls available on an instance of /dev/ppp attached to a
channel are:
-* PPPIOCDETACH detaches the instance from the channel. This ioctl is
- deprecated since the same effect can be achieved by closing the
- instance. In order to prevent possible races this ioctl will fail
- with an EINVAL error if more than one file descriptor refers to this
- instance (i.e. as a result of dup(), dup2() or fork()).
-
* PPPIOCCONNECT connects this channel to a PPP interface. The
argument should point to an int containing the interface unit
number. It will return an EINVAL error if the channel is already