| Age | Commit message (Collapse) | Author |
|
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-06-17
The following pull-request contains BPF updates for your *net-next* tree.
We've added 50 non-merge commits during the last 25 day(s) which contain
a total of 148 files changed, 4779 insertions(+), 1248 deletions(-).
The main changes are:
1) BPF infrastructure to migrate TCP child sockets from a listener to another
in the same reuseport group/map, from Kuniyuki Iwashima.
2) Add a provably sound, faster and more precise algorithm for tnum_mul() as
noted in https://arxiv.org/abs/2105.05398, from Harishankar Vishwanathan.
3) Streamline error reporting changes in libbpf as planned out in the
'libbpf: the road to v1.0' effort, from Andrii Nakryiko.
4) Add broadcast support to xdp_redirect_map(), from Hangbin Liu.
5) Extends bpf_map_lookup_and_delete_elem() functionality to 4 more map
types, that is, {LRU_,PERCPU_,LRU_PERCPU_,}HASH, from Denis Salopek.
6) Support new LLVM relocations in libbpf to make them more linker friendly,
also add a doc to describe the BPF backend relocations, from Yonghong Song.
7) Silence long standing KUBSAN complaints on register-based shifts in
interpreter, from Daniel Borkmann and Eric Biggers.
8) Add dummy PT_REGS macros in libbpf to fail BPF program compilation when
target arch cannot be determined, from Lorenz Bauer.
9) Extend AF_XDP to support large umems with 1M+ pages, from Magnus Karlsson.
10) Fix two minor libbpf tc BPF API issues, from Kumar Kartikeya Dwivedi.
11) Move libbpf BPF_SEQ_PRINTF/BPF_SNPRINTF macros that can be used by BPF
programs to bpf_helpers.h header, from Florent Revest.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Th_strings arrays netdev_features_strings, tunable_strings, and
phy_tunable_strings has been moved to file net/ethtool/common.c.
So fixes the comment.
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Make the gathered SMC statistics network namespace aware, for each
namespace collect an own set of statistic information.
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add support to collect more detailed SMC fallback reason statistics and
provide these statistics to user space on the netlink interface.
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add the netlink function which collects the statistics information and
delivers it to the userspace.
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The round-robin rr_tx_counter was shared across CPUs leading to
significant cache thrashing at high packet rates. This patch switches
the round-robin packet counter to use a per-cpu variable to decide
the destination slave.
On a test with 2x100Gbit ICE nic with pktgen_sample_04_many_flows.sh
(-s 64 -t 32) the tx rate was 19.6Mpps before and 22.3Mpps after
this patch.
"perf top -e cache_misses" before:
12.31% [bonding] [k] bond_xmit_roundrobin_slave_get
10.59% [sch_fq_codel] [k] fq_codel_dequeue
9.34% [kernel] [k] skb_release_data
after:
15.42% [sch_fq_codel] [k] fq_codel_dequeue
10.06% [kernel] [k] __memset
9.12% [kernel] [k] skb_release_data
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch introduces a new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT
to check if the attached eBPF program is capable of migrating sockets. When
the eBPF program is attached, we run it for socket migration if the
expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE or
net.ipv4.tcp_migrate_req is enabled.
Currently, the expected_attach_type is not enforced for the
BPF_PROG_TYPE_SK_REUSEPORT type of program. Thus, this commit follows the
earlier idea in the commit aac3fc320d94 ("bpf: Post-hooks for sys_bind") to
fix up the zero expected_attach_type in bpf_prog_load_fixup_attach_type().
Moreover, this patch adds a new field (migrating_sk) to sk_reuseport_md to
select a new listener based on the child socket. migrating_sk varies
depending on if it is migrating a request in the accept queue or during
3WHS.
- accept_queue : sock (ESTABLISHED/SYN_RECV)
- 3WHS : request_sock (NEW_SYN_RECV)
In the eBPF program, we can select a new listener by
BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning
SK_DROP. This feature is useful when listeners have different settings at
the socket API level or when we want to free resources as soon as possible.
- SK_PASS with selected_sk, select it as a new listener
- SK_PASS with selected_sk NULL, fallbacks to the random selection
- SK_DROP, cancel the migration.
There is a noteworthy point. We select a listening socket in three places,
but we do not have struct skb at closing a listener or retransmitting a
SYN+ACK. On the other hand, some helper functions do not expect skb is NULL
(e.g. skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer()
in BPF_FUNC_skb_load_bytes_relative()). So we allocate an empty skb
temporarily before running the eBPF program.
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/netdev/20201203042402.6cskdlit5f3mw4ru@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/netdev/20201209030903.hhow5r53l6fmozjn@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/bpf/20210612123224.12525-10-kuniyu@amazon.co.jp
|
|
We will call sock_reuseport.prog for socket migration in the next commit,
so the eBPF program has to know which listener is closing to select a new
listener.
We can currently get a unique ID of each listener in the userspace by
calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map.
This patch makes the pointer of sk available in sk_reuseport_md so that we
can get the ID by BPF_FUNC_get_socket_cookie() in the eBPF program.
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/netdev/20201119001154.kapwihc2plp4f7zc@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/bpf/20210612123224.12525-9-kuniyu@amazon.co.jp
|
|
reuseport_migrate_sock() does the same check done in
reuseport_listen_stop_sock(). If the reuseport group is capable of
migration, reuseport_migrate_sock() selects a new listener by the child
socket hash and increments the listener's sk_refcnt beforehand. Thus, if we
fail in the migration, we have to decrement it later.
We will support migration by eBPF in the later commits.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-5-kuniyu@amazon.co.jp
|
|
When we close a listening socket, to migrate its connections to another
listener in the same reuseport group, we have to handle two kinds of child
sockets. One is that a listening socket has a reference to, and the other
is not.
The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the
accept queue of their listening socket. So we can pop them out and push
them into another listener's queue at close() or shutdown() syscalls. On
the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the
three-way handshake and not in the accept queue. Thus, we cannot access
such sockets at close() or shutdown() syscalls. Accordingly, we have to
migrate immature sockets after their listening socket has been closed.
Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV
sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At
that time, if we could select a new listener from the same reuseport group,
no connection would be aborted. However, we cannot do that because
reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to
the reuseport group from closed sockets.
This patch allows TCP_CLOSE sockets to remain in the reuseport group and
access it while any child socket references them. The point is that
reuseport_detach_sock() was called twice from inet_unhash() and
sk_destruct(). This patch replaces the first reuseport_detach_sock() with
reuseport_stop_listen_sock(), which checks if the reuseport group is
capable of migration. If capable, it decrements num_socks, moves the socket
backwards in socks[] and increments num_closed_socks. When all connections
are migrated, sk_destruct() calls reuseport_detach_sock() to remove the
socket from socks[], decrement num_closed_socks, and set NULL to
sk_reuseport_cb.
By this change, closed or shutdowned sockets can keep sk_reuseport_cb.
Consequently, calling listen() after shutdown() can cause EADDRINUSE or
EBUSY in inet_csk_bind_conflict() or reuseport_add_sock() which expects
such sockets not to have the reuseport group. Therefore, this patch also
loosens such validation rules so that a socket can listen again if it has a
reuseport group with num_closed_socks more than 0.
When such sockets listen again, we handle them in reuseport_resurrect(). If
there is an existing reuseport group (reuseport_add_sock() path), we move
the socket from the old group to the new one and free the old one if
necessary. If there is no existing group (reuseport_alloc() path), we
allocate a new reuseport group, detach sk from the old one, and free it if
necessary, not to break the current shutdown behaviour:
- we cannot carry over the eBPF prog of shutdowned sockets
- we cannot attach/detach an eBPF prog to/from listening sockets via
shutdowned sockets
Note that when the number of sockets gets over U16_MAX, we try to detach a
closed socket randomly to make room for the new listening socket in
reuseport_grow().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-4-kuniyu@amazon.co.jp
|
|
As noted in the following commit, a closed listener has to hold the
reference to the reuseport group for socket migration. This patch adds a
field (num_closed_socks) to struct sock_reuseport to manage closed sockets
within the same reuseport group. Moreover, this and the following commits
introduce some helper functions to split socks[] into two sections and keep
TCP_LISTEN and TCP_CLOSE sockets in each section. Like a double-ended
queue, we will place TCP_LISTEN sockets from the front and TCP_CLOSE
sockets from the end.
TCP_LISTEN----------> <-------TCP_CLOSE
+---+---+ --- +---+ --- +---+ --- +---+
| 0 | 1 | ... | i | ... | j | ... | k |
+---+---+ --- +---+ --- +---+ --- +---+
i = num_socks - 1
j = max_socks - num_closed_socks
k = max_socks - 1
This patch also extends reuseport_add_sock() and reuseport_grow() to
support num_closed_socks.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-3-kuniyu@amazon.co.jp
|
|
This commit adds a new sysctl option: net.ipv4.tcp_migrate_req. If this
option is enabled or eBPF program is attached, we will be able to migrate
child sockets from a listener to another in the same reuseport group after
close() or shutdown() syscalls.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-2-kuniyu@amazon.co.jp
|
|
FW is now supporting more than 256 MSI-X per PF (up to 2K).
Hence, enlarge interrupt field in CREATE_EQ to make use of the new
MSI-X's.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
The users of EQ are running their code on different CPUs and with
various affinity patterns. Move the cpumask setting close to their
actual usage.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Whenever query statistics is issued for trap, devlink subsystem
would also fill-in statistics 'dropped' field. This field indicates
the number of packets HW dropped and failed to report to the device driver,
and thus - to the devlink subsystem itself.
In case if device driver didn't register callback for hard drop
statistics querying, 'dropped' field will be omitted and not filled.
Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch support for cable test for the ksz886x switches and the
ksz8081 PHY.
The patch was tested on a KSZ8873RLL switch with following results:
- port 1:
- provides invalid values, thus return -ENOTSUPP
(Errata: DS80000830A: "LinkMD does not work on Port 1",
http://ww1.microchip.com/downloads/en/DeviceDoc/KSZ8873-Errata-DS80000830A.pdf)
- port 2:
- can detect distance
- can detect open on each wire of pair A (wire 1 and 2)
- can detect open only on one wire of pair B (only wire 3)
- can detect short between wires of a pair (wires 1 + 2 or 3 + 6)
- short between pairs is detected as open.
For example short between wires 2 + 3 is detected as open.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add support for MDI-X status and configuration
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Some micrel devices share the same PHY register defines. This patch
moves them to one common header so other drivers can reuse them.
And reuse generic MII_* defines where possible.
Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The csum_value field in the rmnet_map_dl_csum_trailer structure is a
"real" Internet checksum. It is a 16 bit value, in big endian format,
which represents an inverted ones' complement sum over pairs of bytes.
Make that clear by changing its type to __sum16.
This makes a typecast in rmnet_map_ipv4_dl_csum_trailer() and
another in rmnet_map_ipv6_dl_csum_trailer() unnecessary.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add support to create (and destroy) interfaces via a new
rtnetlink kind "wwan". The responsible driver has to use
the new wwan_register_ops() to make this possible.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In some cases, for example in the upcoming WWAN framework changes,
there's no natural "parent netdev", so sometimes dummy netdevs are
created or similar. IFLA_PARENT_DEV_NAME is a new attribute intended to
contain a device (sysfs, struct device) name that can be used instead
when creating a new netdev, if the rtnetlink family implements it.
As suggested by Parav Pandit, we also introduce IFLA_PARENT_DEV_BUS_NAME
attribute in order to uniquely identify a device on the system (with
bus/name pair).
ip-link(8) support for the generic parent device attributes will help
us avoid code duplication, so no other link type will require a custom
code to handle the parent name attribute. E.g. the WWAN interface
creation command will looks like this:
$ ip link add wwan0-1 parent-dev wwan0 type wwan channel-id 1
So, some future subsystem (or driver) FOO will have an interface
creation command that looks like this:
$ ip link add foo1-3 parent-dev foo1 type foo bar-id 3 baz-type Y
Below is an example of dumping link info of a random device with these
new attributes:
$ ip --details link show wlp0s20f3
4: wlp0s20f3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
state UP mode DORMANT group default qlen 1000
...
parent_bus pci parent_dev 0000:00:14.3
Co-developed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Co-developed-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Suggested-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In order to make rtnetlink ops that can create different
kinds of devices, like what we want to add to the WWAN
framework, the priv_size and setup parameters aren't quite
sufficient. Make this easier to manage by allowing ops to
allocate their own netdev via an @alloc method that gets
the tb netlink data.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add 25gbase-r phy interface mode
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The sja1105 hardware has a quirk in that some changes require a switch
reset, which loses all configuration. When the reset is initiated,
everything needs to be reprogrammed, including the MACs and the PCS.
This is currently done in sja1105_static_config_reload() - we manually
call sja1105_adjust_port_config(), sja1105_sgmii_pcs_config() and
sja1105_sgmii_pcs_force_speed() which are all internal functions.
There is a desire for sja1105 to use the common xpcs driver, and that
means that the equivalents of those functions, xpcs_do_config() and
xpcs_link_up() respectively, will no longer be local functions.
Forcing phylink to retrigger a resolve somehow, say by doing dev_close()
followed by dev_open() is not really an option, because the CPU port
might have a PCS as well, and there is no net device which we can close
and reopen for that. Additionally, the dev_close/dev_open sequence might
force a renegotiation of the copper-side link for SGMII ports connected
to a PHY, and this is undesirable as well, because the switch reset is
much quicker than a PHY autoneg, so we would have a lot more downtime.
The only solution I see is for the sja1105 driver to keep doing what
it's doing, and that means we need to export the equivalents from xpcs
for sja1105_sgmii_pcs_config and sja1105_sgmii_pcs_force_speed, and call
them directly in sja1105_static_config_reload(). This will be done
during the conversion patch.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The NXP SJA1110 switch integrates its own, non-Synopsys PMA, but it
manages it through the register space of the XPCS itself, in a small
register window inside MDIO_MMD_VEND2 from address 0x8030 to 0x806e.
This coincides with where the registers for the default Synopsys PMA
are, but the register definitions are of course not the same.
This situation is an odd hardware quirk, but the simplest way to manage
it is to drive the SJA1110's PMA from within the XPCS driver.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The NXP SJA1105 DSA switch integrates a Synopsys SGMII XPCS on port 4.
The generic code works fine, except there is an integration issue which
needs to be dealt with: in this switch, the XPCS is integrated with a
PMA that has the TX lane polarity inverted by default (PLUS is MINUS,
MINUS is PLUS).
To obtain normal non-inverted behavior, the TX lane polarity must be
inverted in the PCS, via the DIGITAL_CONTROL_2 register.
We introduce a pma_config() method in xpcs_compat which is called by the
phylink_pcs_config() implementation.
Also, the NXP SJA1105 returns all zeroes in the PHY ID registers 2 and 3.
We need to hack up an ad-hoc PHY ID (OUI is zero, device ID is 1) in
order for the XPCS driver to recognize it. This PHY ID is added to the
public include/linux/pcs/pcs-xpcs.h for that reason (for the sja1105
driver to be able to use it in a later patch).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The struct mdio_xpcs_args is reminiscent of when a similarly named
struct mdio_xpcs_ops existed. Now that that is removed, we can shorten
the name to dw_xpcs (dw for DesignWare).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
Jake Keller says:
====================
100GbE Intel Wired LAN Driver Updates 2021-06-11
Extend the ice driver to support basic PTP clock functionality for E810
devices.
This includes some tangential work required to setup the sideband queue and
driver shared parameters as well.
This series only supports E810-based devices. This is because other devices
based on the E822 MAC use a different and more complex PHY.
The low level device functionality is kept within ice_ptp_hw.c and is
designed to be extensible for supporting E822 devices in a future series.
This series also only supports very basic functionality including the
ptp_clock device and timestamping. Support for configuring periodic outputs
and external input timestamps will be implemented in a future series.
There are a couple of potential "what? why?" bits in this series I want to
point out:
1) the PTP hardware functionality is shared between multiple functions. This
means that the same clock registers are shared across multiple PFs. In order
to avoid contention or clashing between PFs, firmware assigns "ownership" to
one PF, while other PFs are merely "associated" with the timer. Because we
share the hardware resource, only the clock owner will allocate and register
a PTP clock device. Other PFs determine the appropriate PTP clock index to
report by using a firmware interface to read a shared parameter that is set
by the owning PF.
2) the ice driver uses its own kthread instead of using do_aux_work. This is
because the periodic and asynchronous tasks are necessary for all PFs, but
only one PF will allocate the clock.
The series is broken up into functional pieces to allow easy review.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add SEQPACKET socket type to vsock trace event.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Small updates to make SOCK_SEQPACKET work:
1) Send SHUTDOWN on socket close for SEQPACKET type.
2) Set SEQPACKET packet type during send.
3) Set 'VIRTIO_VSOCK_SEQ_EOR' bit in flags for last
packet of message.
4) Implement data check function for SEQPACKET.
5) Check for max datagram size.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Callback fetches RW packets from rx queue of socket until whole record
is copied(if user's buffer is full, user is not woken up). This is done
to not stall sender, because if we wake up user and it leaves syscall,
nobody will send credit update for rest of record, and sender will wait
for next enter of read syscall at receiver's side. So if user buffer is
full, we just send credit update and drop data.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add set of defines and constants for SOCK_SEQPACKET support
in vsock.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add socket ops for SEQPACKET type and .seqpacket_allow() callback
to query transports if they support SEQPACKET. Also split path
for data check for STREAM and SEQPACKET branches.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Update current stream enqueue function for SEQPACKET
support:
1) Call transport's seqpacket enqueue callback.
2) Return value from enqueue function is whole record length or error
for SOCK_SEQPACKET.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add receive loop for SEQPACKET. It looks like receive loop for
STREAM, but there are differences:
1) It doesn't call notify callbacks.
2) It doesn't care about 'SO_SNDLOWAT' and 'SO_RCVLOWAT' values, because
there is no sense for these values in SEQPACKET case.
3) It waits until whole record is received.
4) It processes and sets 'MSG_TRUNC' flag.
So to avoid extra conditions for two types of socket inside one loop, two
independent functions were created.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Define phylink_fwnode_phy_connect() to connect phy specified by
a fwnode to a phylink instance.
Signed-off-by: Calvin Johnson <calvin.johnson@oss.nxp.com>
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Acked-by: Grant Likely <grant.likely@arm.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Define acpi_mdiobus_register() to Register mii_bus and create PHYs for
each ACPI child node.
Signed-off-by: Calvin Johnson <calvin.johnson@oss.nxp.com>
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Acked-by: Grant Likely <grant.likely@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Introduce a wrapper around the _ADR evaluation.
Signed-off-by: Calvin Johnson <calvin.johnson@oss.nxp.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Acked-by: Grant Likely <grant.likely@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Introduce fwnode_mdiobus_register_phy() to register PHYs on the
mdiobus. From the compatible string, identify whether the PHY is
c45 and based on this create a PHY device instance which is
registered on the mdiobus.
Along with fwnode_mdiobus_register_phy() also introduce
fwnode_find_mii_timestamper() and fwnode_mdiobus_phy_device_register()
since they are needed.
While at it, also use the newly introduced fwnode operation in
of_mdiobus_phy_device_register().
Signed-off-by: Calvin Johnson <calvin.johnson@oss.nxp.com>
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Acked-by: Grant Likely <grant.likely@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Extract phy_id from compatible string. This will be used by
fwnode_mdiobus_register_phy() to create phy device using the
phy_id.
Signed-off-by: Calvin Johnson <calvin.johnson@oss.nxp.com>
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Acked-by: Grant Likely <grant.likely@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Define fwnode_phy_find_device() to iterate an mdiobus and find the
phy device of the provided phy fwnode. Additionally define
device_phy_find_device() to find phy device of provided device.
Define fwnode_get_phy_node() to get phy_node using named reference.
Signed-off-by: Calvin Johnson <calvin.johnson@oss.nxp.com>
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Acked-by: Grant Likely <grant.likely@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Define fwnode_mdio_find_device() to get a pointer to the
mdio_device from fwnode passed to the function.
Refactor of_mdio_find_device() to use fwnode_mdio_find_device().
Signed-off-by: Calvin Johnson <calvin.johnson@oss.nxp.com>
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Acked-by: Grant Likely <grant.likely@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The TX timestamping procedure for SJA1105 is a bit unconventional
because the transmit procedure itself is unconventional.
Control packets (and therefore PTP as well) are transmitted to a
specific port in SJA1105 using "management routes" which must be written
over SPI to the switch. These are one-shot rules that match by
destination MAC address on traffic coming from the CPU port, and select
the precise destination port for that packet. So to transmit a packet
from NET_TX softirq context, we actually need to defer to a process
context so that we can perform that SPI write before we send the packet.
The DSA master dev_queue_xmit() runs in process context, and we poll
until the switch confirms it took the TX timestamp, then we annotate the
skb clone with that TX timestamp. This is why the sja1105 driver does
not need an skb queue for TX timestamping.
But the SJA1110 is a bit (not much!) more conventional, and you can
request 2-step TX timestamping through the DSA header, as well as give
the switch a cookie (timestamp ID) which it will give back to you when
it has the timestamp. So now we do need a queue for keeping the skb
clones until their TX timestamps become available.
The interesting part is that the metadata frames from SJA1105 haven't
disappeared completely. On SJA1105 they were used as follow-ups which
contained RX timestamps, but on SJA1110 they are actually TX completion
packets, which contain a variable (up to 32) array of timestamps.
Why an array? Because:
- not only is the TX timestamp on the egress port being communicated,
but also the RX timestamp on the CPU port. Nice, but we don't care
about that, so we ignore it.
- because a packet could be multicast to multiple egress ports, each
port takes its own timestamp, and the TX completion packet contains
the individual timestamps on each port.
This is unconventional because switches typically have a timestamping
FIFO and raise an interrupt, but this one doesn't. So the tagger needs
to detect and parse meta frames, and call into the main switch driver,
which pairs the timestamps with the skbs in the TX timestamping queue
which are waiting for one.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The SJA1110 has improved a few things compared to SJA1105:
- To send a control packet from the host port with SJA1105, one needed
to program a one-shot "management route" over SPI. This is no longer
true with SJA1110, you can actually send "in-band control extensions"
in the packets sent by DSA, these are in fact DSA tags which contain
the destination port and switch ID.
- When receiving a control packet from the switch with SJA1105, the
source port and switch ID were written in bytes 3 and 4 of the
destination MAC address of the frame (which was a very poor shot at a
DSA header). If the control packet also had an RX timestamp, that
timestamp was sent in an actual follow-up packet, so there were
reordering concerns on multi-core/multi-queue DSA masters, where the
metadata frame with the RX timestamp might get processed before the
actual packet to which that timestamp belonged (there is no way to
pair a packet to its timestamp other than the order in which they were
received). On SJA1110, this is no longer true, control packets have
the source port, switch ID and timestamp all in the DSA tags.
- Timestamps from the switch were partial: to get a 64-bit timestamp as
required by PTP stacks, one would need to take the partial 24-bit or
32-bit timestamp from the packet, then read the current PTP time very
quickly, and then patch in the high bits of the current PTP time into
the captured partial timestamp, to reconstruct what the full 64-bit
timestamp must have been. That is awful because packet processing is
done in NAPI context, but reading the current PTP time is done over
SPI and therefore needs sleepable context.
But it also aggravated a few things:
- Not only is there a DSA header in SJA1110, but there is a DSA trailer
in fact, too. So DSA needs to be extended to support taggers which
have both a header and a trailer. Very unconventional - my understanding
is that the trailer exists because the timestamps couldn't be prepared
in time for putting them in the header area.
- Like SJA1105, not all packets sent to the CPU have the DSA tag added
to them, only control packets do:
* the ones which match the destination MAC filters/traps in
MAC_FLTRES1 and MAC_FLTRES0
* the ones which match FDB entries which have TRAP or TAKETS bits set
So we could in theory hack something up to request the switch to take
timestamps for all packets that reach the CPU, and those would be
DSA-tagged and contain the source port / switch ID by virtue of the
fact that there needs to be a timestamp trailer provided. BUT:
- The SJA1110 does not parse its own DSA tags in a way that is useful
for routing in cross-chip topologies, a la Marvell. And the sja1105
driver already supports cross-chip bridging from the SJA1105 days.
It does that by automatically setting up the DSA links as VLAN trunks
which contain all the necessary tag_8021q RX VLANs that must be
communicated between the switches that span the same bridge. So when
using tag_8021q on sja1105, it is possible to have 2 switches with
ports sw0p0, sw0p1, sw1p0, sw1p1, and 2 VLAN-unaware bridges br0 and
br1, and br0 can take sw0p0 and sw1p0, and br1 can take sw0p1 and
sw1p1, and forwarding will happen according to the expected rules of
the Linux bridge.
We like that, and we don't want that to go away, so as a matter of
fact, the SJA1110 tagger still needs to support tag_8021q.
So the sja1110 tagger is a hybrid between tag_8021q for data packets,
and the native hardware support for control packets.
On RX, packets have a 13-byte trailer if they contain an RX timestamp.
That trailer is padded in such a way that its byte 8 (the start of the
"residence time" field - not parsed by Linux because we don't care) is
aligned on a 16 byte boundary. So the padding has a variable length
between 0 and 15 bytes. The DSA header contains the offset of the
beginning of the padding relative to the beginning of the frame (and the
end of the padding is obviously the end of the packet minus 13 bytes,
the length of the trailer). So we discard it.
Packets which don't have a trailer contain the source port and switch ID
information in the header (they are "trap-to-host" packets). Packets
which have a trailer contain the source port and switch ID in the trailer.
On TX, the destination port mask and switch ID is always in the trailer,
so we always need to say in the header that a trailer is present.
The header needs a custom EtherType and this was chosen as 0xdadc, after
0xdada which is for Marvell and 0xdadb which is for VLANs in
VLAN-unaware mode on SJA1105 (and SJA1110 in fact too).
Because we use tag_8021q in concert with the native tagging protocol,
control packets will have 2 DSA tags.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In SJA1105, RX timestamps for packets sent to the CPU are transmitted in
separate follow-up packets (metadata frames). These contain partial
timestamps (24 or 32 bits) which are kept in SJA1105_SKB_CB(skb)->meta_tstamp.
Thankfully, SJA1110 improved that, and the RX timestamps are now
transmitted in-band with the actual packet, in the timestamp trailer.
The RX timestamps are now full-width 64 bits.
Because we process the RX DSA tags in the rcv() method in the tagger,
but we would like to preserve the DSA code structure in that we populate
the skb timestamp in the port_rxtstamp() call which only happens later,
the implication is that we must somehow pass the 64-bit timestamp from
the rcv() method all the way to port_rxtstamp(). We can use the skb->cb
for that.
Rename the meta_tstamp from struct sja1105_skb_cb from "meta_tstamp" to
"tstamp", and increase its size to 64 bits.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The added value of this function is that it can deal with both the case
where the VLAN header is in the skb head, as well as in the offload field.
This is something I was not able to do using other functions in the
network stack.
Since both ocelot-8021q and sja1105 need to do the same stuff, let's
make it a common service provided by tag_8021q.
This is done as refactoring for the new SJA1110 tagger, which partly
uses tag_8021q as well (just like SJA1105), and will be the third caller.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
All users of tag_8021q select it in Kconfig, so shim functions are not
needed because it is not possible for it to be disabled and its callers
enabled.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Some really really weird switches just couldn't decide whether to use a
normal or a tail tagger, so they just did both.
This creates problems for DSA, because we only have the concept of an
'overhead' which can be applied to the headroom or to the tailroom of
the skb (like for example during the central TX reallocation procedure),
depending on the value of bool tail_tag, but not to both.
We need to generalize DSA to cater for these odd switches by
transforming the 'overhead / tail_tag' pair into 'needed_headroom /
needed_tailroom'.
The DSA master's MTU is increased to account for both.
The flow dissector code is modified such that it only calls the DSA
adjustment callback if the tagger has a non-zero header length.
Taggers are trivially modified to declare either needed_headroom or
needed_tailroom, based on the tail_tag value that they currently
declare.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add the ice_ptp_hw.c file and some associated definitions to the ice
driver folder. This file contains basic low level definitions for
functions that interact with the device hardware.
For now, only E810-based devices are supported. The ice hardware
supports 2 major variants which have different PHYs with different
procedures necessary for interacting with the device clock.
Because the device captures timestamps in the PHY, each PHY has its own
internal timer. The timers are synchronized in hardware by first
preparing the source timer and the PHY timer shadow registers, and then
issuing a synchronization command. This ensures that both the source
timer and PHY timers are programmed simultaneously. The timers
themselves are all driven from the same oscillator source.
The functions in ice_ptp_hw.c abstract over the differences between how
the PHYs in E810 are programmed vs how the PHYs in E822 devices are
programmed. This series only implements E810 support, but E822 support
will be added in a future change.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2021-06-09
Introduce steering header insert/remove and switchdev bridge offloads
1) From Yevgeny, Steering header insert/remove support
ConnectX supports offloading of various encapsulations and decapsulations
(e.g. VXLAN), which are performed by 'Packet Reformat' action.
Starting with ConnectX-6 DX, a new reformat type is supported - INSERT_HEADER.
This reformat allows inserting an arbitrary size buffer at a selected location
in the packet on RX flows.
The insert/remove header support are needed as a prerequisite for the
bridge offloads vlan pop/push supprt, see below.
2) From Vlad, Support for bridge offloads for switchdev mode
This change implements bridge offloads with VLAN-support that works on top
of mlx5 representors in switchdev mode.
HIGH-LEVEL OVERVIEW
Hardware supported by mlx5 driver doesn't provide dynamic learning or aging
functionality and requires the driver to emulate all switch-like behavior
in software. As such, all packets by default go through miss path, appear
on representor and get to software bridge, if it is the upper device of the
representor. This causes bridge to process packet in software, learn the
MAC address to FDB and send SWITCHDEV_FDB_ADD_TO_DEVICE event to all
subscribers. Upon reception of SWITCHDEV_FDB_ADD_TO_DEVICE notification
mlx5 bridge offloads the FDB to hardware and sends back
SWITCHDEV_FDB_ADD_TO_BRIDGE notification to prevent such entries from being
aged out by kernel bridge. Leaving aging to kernel bridge would result
deletion of offloaded dynamic FDB entries every aging_time period due to
packets being processed by hardware and, consecutively, 'used' timestamp
for FDB entry not being updated. Hardware aging is emulated in driver by
running periodic workqueue task that manually updates the rules according
to their hardware counter:
- If hardware counter has changed since last update, the handler updates
'used' timestamp in kernel bridge dynamic entry by sending
SWITCHDEV_FDB_ADD_TO_BRIDGE notification for the entry.
- If FDB entry wasn't updated for user-controllable aging_time period,
then the FDB entry is unoffloaded from hardware and corresponding
SWITCHDEV_FDB_DEL_TO_BRIDGE notification is sent to kernel bridge.
The mlx5 bridge offload implementation fully supports port VLAN objects,
including PVID (vlan push) and "Egress Untagged" (vlan pop).
SOFTWARE ARCHITECTURE
Mlx5_eswitch is extended with pointer to new mlx5_esw_bridge_offloads
structure which has a linked list of mlx5_esw_bridge objects. Struct
mlx5_esw_bridge is the main switch object in mlx5 that holds all data for
offloaded FDB entries and metadata for bridge ports and their vlans. The
mlx5_esw_bridge object is created when first representor of eswitch vport
is added to bridge and deleted when the last representor is detached from
it. Bridge FDB entries are saved in linked list (to iterate over all FDB
entries in aging workqueue task) and also in hashtable for quick lookup by
MAC+VLAN tuple. Bridge FDB entries are saved in linked list (to iterate
over all FDB entries in aging workqueue task) and in hashtable for quick
lookup by MAC+VLAN tuple. Port metadata is stored in struct
mlx5_esw_bridge_port that is saved in xarray to allow quick lookup by vport
number. Part of the port metadata is the set of port vlans that are
represented by mlx5_esw_bridge_vlan structure. The vlan structure points to
all FDBs on vlan/port via fdb_list linked list.
Simplified diagram of mlx5 bridge objects:
+------------------+
| mxl5_eswitch |
| |
| br_offloads |
+--------+---------+
|
+--------v-------------------+
| mlx5_esw_bridge_offloads |
| |
+--> bridges |
| +-------+--------------------+
| |
| |
| +---v---------------+
| | mlx5_esw_bridge |
| | |
| | vports |
| | |
| | fdb_ht |
| +---+---------------+
| |
| +---v---------------+
+------+ mlx5_esw_bridge |
| |
+-------------------------+ vports |
| | |
| | fdb_ht +------------------------------------------+
| +-------------------+ |
| |
| |
| +----------------------+ +---------------------------+ |
+-> mlx5_esw_bridge_port | +--> mlx5_esw_bridge_fdb_entry <-+
| | | +----------------------+ | +--+------------------------+ |
| | vlans +--+-> mlx5_esw_bridge_vlan | | | |
| | | | | | | +--v------------------------+ |
| +----------------------+ | | fdb_list +--+ | mlx5_esw_bridge_fdb_entry <-+
| | +-------^--------------+ +--+------------------------+ |
| +----------------------+ | | | |
+-> mlx5_esw_bridge_port | | +-----------------------+ |
| | | |
| vlans | | -----------------------+ |
| | +-> mlx5_esw_bridge_vlan | |
+----------------------+ | | +---------------------------+ |
| fdb_list +-----> mlx5_esw_bridge_fdb_entry <-+
+-------^--------------+ +--+------------------------+
| |
+-----------------------+
HARDWARE REPRESENTATION
In order to adhere to kernel software datapath model bridge offloads must
come after TC and NF FDBs. However, since netfilter offload in mlx5 is
implemented with unmanaged tables, its miss path is not automatically
connected to next priority and requires the code to manually connect with
slow table. To keep bridge offloads encapsulated and not mix it with
eswitch offloads new FDB_TC_MISS priority is created between FDB_FT_OFFLOAD
and FDB_SLOW_PATH which allows bridge offloads to be created without
exposing its internal tables to any other modules since miss path of
managed TC-miss table is automatically wired to next priority.
The bridge tables are created with new priority FDB_BR_OFFLOAD in FDB
namespace. The new priority is between tc-miss and slow path priorities.
Priority consist of two levels: the ingress table that is global per
eswitch and matches incoming packets by src_mac/vid and redirects them to
next level (egress table) that is chosen according to ingress port bridge
membership and matches on dst_mac/vid in order to redirect packet to vport
according to the following diagram:
+
|
+---------v----------+
| |
| FDB_TC_OFFLOAD |
| |
+---------+----------+
|
|
+---------v----------+
| |
| FDB_FT_OFFLOAD |
| |
+---------+----------+
|
|
+---------v----------+
| |
| FDB_TC_MISS |
| |
+---------+----------+
|
+--------------------------------------+
| | |
| +------+ |
| | |
| +------v--------+ FDB_BR_OFFLOAD |
| | INGRESS_TABLE | |
| +------+---+----+ |
| | | match |
| | +---------+ |
| | | | +-------+
| | +-------v-------+ match | | |
| | | EGRESS_TABLE +------------> vport |
| | +-------+-------+ | | |
| | | | +-------+
| | miss | |
| +------+------+ |
| | |
+--------------------------------------+
|
|
+---------v----------+
| |
| FDB_SLOW_PATH |
| |
+---------+----------+
|
v
PATCHES OVERVIEW
1-3 - Miscellaneous refactorings and infrastructure changes.
4 - Mlx5 bridge offload infrastructure and dedicated fs_core
namespace/tables implementation.
5 - FDB entry offload.
6 - Dynamic FDB entry aging.
7-10 - VLAN filtering offload.
11 - Tracepoints for main mlx5 bridge offload events (FDB entry
offload/unoffload, VLAN add/delete, etc.)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
--
|