summaryrefslogtreecommitdiff
path: root/include/uapi
AgeCommit message (Collapse)Author
2021-10-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS for net-next: 1) Add new run_estimation toggle to IPVS to stop the estimation_timer logic, from Dust Li. 2) Relax superfluous dynset check on NFT_SET_TIMEOUT. 3) Add egress hook, from Lukas Wunner. 4) Nowadays, almost all hook functions in x_table land just call the hook evaluation loop. Remove remaining hook wrappers from iptables and IPVS. From Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18ether: add EtherType for proprietary Realtek protocolsAlvin Šipraga
Add a new EtherType ETH_P_REALTEK to the if_ether.h uapi header. The EtherType 0x8899 is used in a number of different protocols from Realtek Semiconductor Corp [1], so no general assumptions should be made when trying to decode such packets. Observed protocols include: 0x1 - Realtek Remote Control protocol [2] 0x2 - Echo protocol [2] 0x3 - Loop detection protocol [2] 0x4 - RTL8365MB 4- and 8-byte switch CPU tag protocols [3] 0x9 - RTL8306 switch CPU tag protocol [4] 0xA - RTL8366RB switch CPU tag protocol [4] [1] https://lore.kernel.org/netdev/CACRpkdYQthFgjwVzHyK3DeYUOdcYyWmdjDPG=Rf9B3VrJ12Rzg@mail.gmail.com/ [2] https://www.wireshark.org/lists/ethereal-dev/200409/msg00090.html [3] https://lore.kernel.org/netdev/20210822193145.1312668-4-alvin@pqrs.dk/ [4] https://lore.kernel.org/netdev/20200708122537.1341307-2-linus.walleij@linaro.org/ Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16net/smc: add netlink support for SMC-Rv2Karsten Graul
Implement the netlink support for SMC-Rv2 related attributes that are provided to user space. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-15fq_codel: implement L4S style ce_threshold_ect1 markingEric Dumazet
Add TCA_FQ_CODEL_CE_THRESHOLD_ECT1 boolean option to select Low Latency, Low Loss, Scalable Throughput (L4S) style marking, along with ce_threshold. If enabled, only packets with ECT(1) can be transformed to CE if their sojourn time is above the ce_threshold. Note that this new option does not change rules for codel law. In particular, if TCA_FQ_CODEL_ECN is left enabled (this is the default when fq_codel qdisc is created), ECT(0) packets can still get CE if codel law (as governed by limit/target) decides so. Section 4.3.b of current draft [1] states: b. A scheduler with per-flow queues such as FQ-CoDel or FQ-PIE can be used for L4S. For instance within each queue of an FQ-CoDel system, as well as a CoDel AQM, there is typically also ECN marking at an immediate (unsmoothed) shallow threshold to support use in data centres (see Sec.5.2.7 of [RFC8290]). This can be modified so that the shallow threshold is solely applied to ECT(1) packets. Then if there is a flow of non-ECN or ECT(0) packets in the per-flow-queue, the Classic AQM (e.g. CoDel) is applied; while if there is a flow of ECT(1) packets in the queue, the shallower (typically sub-millisecond) threshold is applied. Tested: tc qd replace dev eth1 root fq_codel ce_threshold_ect1 50usec netperf ... -t TCP_STREAM -- K dctcp tc -s -d qd sh dev eth1 qdisc fq_codel 8022: root refcnt 32 limit 10240p flows 1024 quantum 9212 target 5ms ce_threshold_ect1 49us interval 100ms memory_limit 32Mb ecn drop_batch 64 Sent 14388596616 bytes 9543449 pkt (dropped 0, overlimits 0 requeues 152013) backlog 0b 0p requeues 152013 maxpacket 68130 drop_overlimit 0 new_flow_count 95678 ecn_mark 0 ce_mark 7639 new_flows_len 0 old_flows_len 0 [1] L4S current draft: https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-l4s-arch Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Ingemar Johansson S <ingemar.s.johansson@ericsson.com> Cc: Tom Henderson <tomh@tomh.org> Cc: Bob Briscoe <in@bobbriscoe.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-14netfilter: Introduce egress hookLukas Wunner
Support classifying packets with netfilter on egress to satisfy user requirements such as: * outbound security policies for containers (Laura) * filtering and mangling intra-node Direct Server Return (DSR) traffic on a load balancer (Laura) * filtering locally generated traffic coming in through AF_PACKET, such as local ARP traffic generated for clustering purposes or DHCP (Laura; the AF_PACKET plumbing is contained in a follow-up commit) * L2 filtering from ingress and egress for AVB (Audio Video Bridging) and gPTP with nftables (Pablo) * in the future: in-kernel NAT64/NAT46 (Pablo) The egress hook introduced herein complements the ingress hook added by commit e687ad60af09 ("netfilter: add netfilter ingress hook after handle_ing() under unique static key"). A patch for nftables to hook up egress rules from user space has been submitted separately, so users may immediately take advantage of the feature. Alternatively or in addition to netfilter, packets can be classified with traffic control (tc). On ingress, packets are classified first by tc, then by netfilter. On egress, the order is reversed for symmetry. Conceptually, tc and netfilter can be thought of as layers, with netfilter layered above tc. Traffic control is capable of redirecting packets to another interface (man 8 tc-mirred). E.g., an ingress packet may be redirected from the host namespace to a container via a veth connection: tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container) In this case, netfilter egress classifying is not performed when leaving the host namespace! That's because the packet is still on the tc layer. If tc redirects the packet to a physical interface in the host namespace such that it leaves the system, the packet is never subjected to netfilter egress classifying. That is only logical since it hasn't passed through netfilter ingress classifying either. Packets can alternatively be redirected at the netfilter layer using nft fwd. Such a packet *is* subjected to netfilter egress classifying since it has reached the netfilter layer. Internally, the skb->nf_skip_egress flag controls whether netfilter is invoked on egress by __dev_queue_xmit(). Because __dev_queue_xmit() may be called recursively by tunnel drivers such as vxlan, the flag is reverted to false after sch_handle_egress(). This ensures that netfilter is applied both on the overlay and underlying network. Interaction between tc and netfilter is possible by setting and querying skb->mark. If netfilter egress classifying is not enabled on any interface, it is patched out of the data path by way of a static_key and doesn't make a performance difference that is discernible from noise: Before: 1537 1538 1538 1537 1538 1537 Mb/sec After: 1536 1534 1539 1539 1539 1540 Mb/sec Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec After + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec Before + tc drop: 1620 1619 1619 1619 1620 1620 Mb/sec After + tc drop: 1616 1624 1625 1624 1622 1619 Mb/sec When netfilter egress classifying is enabled on at least one interface, a minimal performance penalty is incurred for every egress packet, even if the interface it's transmitted over doesn't have any netfilter egress rules configured. That is caused by checking dev->nf_hooks_egress against NULL. Measurements were performed on a Core i7-3615QM. Commands to reproduce: ip link add dev foo type dummy ip link set dev foo up modprobe pktgen echo "add_device foo" > /proc/net/pktgen/kpktgend_3 samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1 Accept all traffic with tc: tc qdisc add dev foo clsact tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,' Drop all traffic with tc: tc qdisc add dev foo clsact tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,' Apply this patch when measuring packet drops to avoid errors in dmesg: https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/ Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: Laura García Liébana <nevola@gmail.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-12net, neigh: Add NTF_MANAGED flag for managed neighbor entriesDaniel Borkmann
Allow a user space control plane to insert entries with a new NTF_EXT_MANAGED flag. The flag then indicates to the kernel that the neighbor entry should be periodically probed for keeping the entry in NUD_REACHABLE state iff possible. The use case for this is targeting XDP or tc BPF load-balancers which use the bpf_fib_lookup() BPF helper in order to piggyback on neighbor resolution for their backends. Given they cannot be resolved in fast-path, a control plane inserts the L3 (without L2) entries manually into the neighbor table and lets the kernel do the neighbor resolution either on the gateway or on the backend directly in case the latter resides in the same L2. This avoids to deal with L2 in the control plane and to rebuild what the kernel already does best anyway. NTF_EXT_MANAGED can be combined with NTF_EXT_LEARNED in order to avoid GC eviction. The kernel then adds NTF_MANAGED flagged entries to a per-neighbor table which gets triggered by the system work queue to periodically call neigh_event_send() for performing the resolution. The implementation allows migration from/to NTF_MANAGED neighbor entries, so that already existing entries can be converted by the control plane if needed. Potentially, we could make the interval for periodically calling neigh_event_send() configurable; right now it's set to DELAY_PROBE_TIME which is also in line with mlxsw which has similar driver-internal infrastructure c723c735fa6b ("mlxsw: spectrum_router: Periodically update the kernel's neigh table"). In future, the latter could possibly reuse the NTF_MANAGED neighbors as well. Example: # ./ip/ip n replace 192.168.178.30 dev enp5s0 managed extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a managed extern_learn REACHABLE [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Roopa Prabhu <roopa@nvidia.com> Link: https://linuxplumbersconf.org/event/11/contributions/953/ Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-12net, neigh: Extend neigh->flags to 32 bit to allow for extensionsRoopa Prabhu
Currently, all bits in struct ndmsg's ndm_flags are used up with the most recent addition of 435f2e7cc0b7 ("net: bridge: add support for sticky fdb entries"). This makes it impossible to extend the neighboring subsystem with new NTF_* flags: struct ndmsg { __u8 ndm_family; __u8 ndm_pad1; __u16 ndm_pad2; __s32 ndm_ifindex; __u16 ndm_state; __u8 ndm_flags; __u8 ndm_type; }; There are ndm_pad{1,2} attributes which are not used. However, due to uncareful design, the kernel does not enforce them to be zero upon new neighbor entry addition, and given they've been around forever, it is not possible to reuse them today due to risk of breakage. One option to overcome this limitation is to add a new NDA_FLAGS_EXT attribute for extended flags. In struct neighbour, there is a 3 byte hole between protocol and ha_lock, which allows neigh->flags to be extended from 8 to 32 bits while still being on the same cacheline as before. This also allows for all future NTF_* flags being in neigh->flags rather than yet another flags field. Unknown flags in NDA_FLAGS_EXT will be rejected by the kernel. Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-08vsock: Enable y2038 safe timeval for timeoutRichard Palethorpe
Reuse the timeval compat code from core/sock to handle 32-bit and 64-bit timeval structures. Also introduce a new socket option define to allow using y2038 safe timeval under 32-bit. The existing behavior of sock_set_timeout and vsock's timeout setter differ when the time value is out of bounds. vsocks current behavior is retained at the expense of not being able to share the full implementation. This allows the LTP test vsock01 to pass under 32-bit compat mode. Fixes: fe0c72f3db11 ("socket: move compat timeout handling into sock.c") Signed-off-by: Richard Palethorpe <rpalethorpe@suse.com> Cc: Richard Palethorpe <rpalethorpe@richiejp.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-07Merge tag 'net-5.15-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from xfrm, bpf, netfilter, and wireless. Current release - regressions: - xfrm: fix XFRM_MSG_MAPPING ABI breakage caused by inserting a new value in the middle of an enum - unix: fix an issue in unix_shutdown causing the other end read/write failures - phy: mdio: fix memory leak Current release - new code bugs: - mlx5e: improve MQPRIO resiliency against bad configs Previous releases - regressions: - bpf: fix integer overflow leading to OOB access in map element pre-allocation - stmmac: dwmac-rk: fix ethernet on rk3399 based devices - netfilter: conntrack: fix boot failure with nf_conntrack.enable_hooks=1 - brcmfmac: revert using ISO3166 country code and 0 rev as fallback - i40e: fix freeing of uninitialized misc IRQ vector - iavf: fix double unlock of crit_lock Previous releases - always broken: - bpf, arm: fix register clobbering in div/mod implementation - netfilter: nf_tables: correct issues in netlink rule change event notifications - dsa: tag_dsa: fix mask for trunked packets - usb: r8152: don't resubmit rx immediately to avoid soft lockup on device unplug - i40e: fix endless loop under rtnl if FW fails to correctly respond to capability query - mlx5e: fix rx checksum offload coexistence with ipsec offload - mlx5: force round second at 1PPS out start time and allow it only in supported clock modes - phy: pcs: xpcs: fix incorrect CL37 AN sequence, EEE disable sequence Misc: - xfrm: slightly rejig the new policy uAPI to make it less cryptic" * tag 'net-5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (66 commits) net: prefer socket bound to interface when not in VRF iavf: fix double unlock of crit_lock i40e: Fix freeing of uninitialized misc IRQ vector i40e: fix endless loop under rtnl dt-bindings: net: dsa: marvell: fix compatible in example ionic: move filter sync_needed bit set gve: report 64bit tx_bytes counter from gve_handle_report_stats() gve: fix gve_get_stats() rtnetlink: fix if_nlmsg_stats_size() under estimation gve: Properly handle errors in gve_assign_qpl gve: Avoid freeing NULL pointer gve: Correct available tx qpl check unix: Fix an issue in unix_shutdown causing the other end read/write failures net: stmmac: trigger PCS EEE to turn off on link down net: pcs: xpcs: fix incorrect steps on disable EEE netlink: annotate data races around nlk->bound net: pcs: xpcs: fix incorrect CL37 AN sequence net: sfp: Fix typo in state machine debug string net/sched: sch_taprio: properly cancel timer from taprio_destroy() net: bridge: fix under estimation in br_get_linkxstats_size() ...
2021-10-07Merge tag 'hyperv-fixes-signed-20211007' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Replace uuid.h with types.h in a header (Andy Shevchenko) - Avoid sleeping in atomic context in PCI driver (Long Li) - Avoid sending IPI to self when it shouldn't (Vitaly Kuznetsov) * tag 'hyperv-fixes-signed-20211007' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: x86/hyperv: Avoid erroneously sending IPI to 'self' hyper-v: Replace uuid.h with types.h PCI: hv: Fix sleep while in non-sleep context when removing child devices from the bus
2021-10-07Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/David S. Miller
ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2021-10-07 1) Fix a sysbot reported shift-out-of-bounds in xfrm_get_default. From Pavel Skripkin. 2) Fix XFRM_MSG_MAPPING ABI breakage. The new XFRM_MSG_MAPPING messages were accidentally not paced at the end. Fix by Eugene Syromiatnikov. 3) Fix the uapi for the default policy, use explicit field and macros and make it accessible to userland. From Nicolas Dichtel. 4) Fix a missing rcu lock in xfrm_notify_userpolicy(). From Nicolas Dichtel. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-06ethtool: Add transceiver module extended stateIdo Schimmel
Add an extended state and sub-state to describe link issues related to transceiver modules. The 'ETHTOOL_LINK_EXT_SUBSTATE_MODULE_CMIS_NOT_READY' extended sub-state tells user space that port is unable to gain a carrier because the CMIS Module State Machine did not reach the ModuleReady (Fully Operational) state. For example, if the module is stuck at ModuleLowPwr or ModuleFault state. In case of the latter, user space can read the fault reason from the module's EEPROM and potentially reset it. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-06ethtool: Add ability to control transceiver modules' power modeIdo Schimmel
Add a pair of new ethtool messages, 'ETHTOOL_MSG_MODULE_SET' and 'ETHTOOL_MSG_MODULE_GET', that can be used to control transceiver modules parameters and retrieve their status. The first parameter to control is the power mode of the module. It is only relevant for paged memory modules, as flat memory modules always operate in low power mode. When a paged memory module is in low power mode, its power consumption is reduced to the minimum, the management interface towards the host is available and the data path is deactivated. User space can choose to put modules that are not currently in use in low power mode and transition them to high power mode before putting the associated ports administratively up. This is useful for user space that favors reduced power consumption and lower temperatures over reduced link up times. In QSFP-DD modules the transition from low power mode to high power mode can take a few seconds and this transition is only expected to get longer with future / more complex modules. User space can control the power mode of the module via the power mode policy attribute ('ETHTOOL_A_MODULE_POWER_MODE_POLICY'). Possible values: * high: Module is always in high power mode. * auto: Module is transitioned by the host to high power mode when the first port using it is put administratively up and to low power mode when the last port using it is put administratively down. The operational power mode of the module is available to user space via the 'ETHTOOL_A_MODULE_POWER_MODE' attribute. The attribute is not reported to user space when a module is not plugged-in. The user API is designed to be generic enough so that it could be used for modules with different memory maps (e.g., SFF-8636, CMIS). The only implementation of the device driver API in this series is for a MAC driver (mlxsw) where the module is controlled by the device's firmware, but it is designed to be generic enough so that it could also be used by implementations where the module is controlled by the CPU. CMIS testing ============ # ethtool -m swp11 Identifier : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628)) ... Module State : 0x03 (ModuleReady) LowPwrAllowRequestHW : Off LowPwrRequestSW : Off The module is not in low power mode, as it is not forced by hardware (LowPwrAllowRequestHW is off) or by software (LowPwrRequestSW is off). The power mode can be queried from the kernel. In case LowPwrAllowRequestHW was on, the kernel would need to take into account the state of the LowPwrRequestHW signal, which is not visible to user space. $ ethtool --show-module swp11 Module parameters for swp11: power-mode-policy high power-mode high Change the power mode policy to 'auto': # ethtool --set-module swp11 power-mode-policy auto Query the power mode again: $ ethtool --show-module swp11 Module parameters for swp11: power-mode-policy auto power-mode low Verify with the data read from the EEPROM: # ethtool -m swp11 Identifier : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628)) ... Module State : 0x01 (ModuleLowPwr) LowPwrAllowRequestHW : Off LowPwrRequestSW : On Put the associated port administratively up which will instruct the host to transition the module to high power mode: # ip link set dev swp11 up Query the power mode again: $ ethtool --show-module swp11 Module parameters for swp11: power-mode-policy auto power-mode high Verify with the data read from the EEPROM: # ethtool -m swp11 Identifier : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628)) ... Module State : 0x03 (ModuleReady) LowPwrAllowRequestHW : Off LowPwrRequestSW : Off Put the associated port administratively down which will instruct the host to transition the module to low power mode: # ip link set dev swp11 down Query the power mode again: $ ethtool --show-module swp11 Module parameters for swp11: power-mode-policy auto power-mode low Verify with the data read from the EEPROM: # ethtool -m swp11 Identifier : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628)) ... Module State : 0x01 (ModuleLowPwr) LowPwrAllowRequestHW : Off LowPwrRequestSW : On SFF-8636 testing ================ # ethtool -m swp13 Identifier : 0x11 (QSFP28) ... Extended identifier description : 5.0W max. Power consumption, High Power Class (> 3.5 W) enabled Power set : Off Power override : On ... Transmit avg optical power (Channel 1) : 0.7733 mW / -1.12 dBm Transmit avg optical power (Channel 2) : 0.7649 mW / -1.16 dBm Transmit avg optical power (Channel 3) : 0.7790 mW / -1.08 dBm Transmit avg optical power (Channel 4) : 0.7837 mW / -1.06 dBm Rcvr signal avg optical power(Channel 1) : 0.9302 mW / -0.31 dBm Rcvr signal avg optical power(Channel 2) : 0.9079 mW / -0.42 dBm Rcvr signal avg optical power(Channel 3) : 0.8993 mW / -0.46 dBm Rcvr signal avg optical power(Channel 4) : 0.8778 mW / -0.57 dBm The module is not in low power mode, as it is not forced by hardware (Power override is on) or by software (Power set is off). The power mode can be queried from the kernel. In case Power override was off, the kernel would need to take into account the state of the LPMode signal, which is not visible to user space. $ ethtool --show-module swp13 Module parameters for swp13: power-mode-policy high power-mode high Change the power mode policy to 'auto': # ethtool --set-module swp13 power-mode-policy auto Query the power mode again: $ ethtool --show-module swp13 Module parameters for swp13: power-mode-policy auto power-mode low Verify with the data read from the EEPROM: # ethtool -m swp13 Identifier : 0x11 (QSFP28) Extended identifier description : 5.0W max. Power consumption, High Power Class (> 3.5 W) not enabled Power set : On Power override : On ... Transmit avg optical power (Channel 1) : 0.0000 mW / -inf dBm Transmit avg optical power (Channel 2) : 0.0000 mW / -inf dBm Transmit avg optical power (Channel 3) : 0.0000 mW / -inf dBm Transmit avg optical power (Channel 4) : 0.0000 mW / -inf dBm Rcvr signal avg optical power(Channel 1) : 0.0000 mW / -inf dBm Rcvr signal avg optical power(Channel 2) : 0.0000 mW / -inf dBm Rcvr signal avg optical power(Channel 3) : 0.0000 mW / -inf dBm Rcvr signal avg optical power(Channel 4) : 0.0000 mW / -inf dBm Put the associated port administratively up which will instruct the host to transition the module to high power mode: # ip link set dev swp13 up Query the power mode again: $ ethtool --show-module swp13 Module parameters for swp13: power-mode-policy auto power-mode high Verify with the data read from the EEPROM: # ethtool -m swp13 Identifier : 0x11 (QSFP28) ... Extended identifier description : 5.0W max. Power consumption, High Power Class (> 3.5 W) enabled Power set : Off Power override : On ... Transmit avg optical power (Channel 1) : 0.7934 mW / -1.01 dBm Transmit avg optical power (Channel 2) : 0.7859 mW / -1.05 dBm Transmit avg optical power (Channel 3) : 0.7885 mW / -1.03 dBm Transmit avg optical power (Channel 4) : 0.7985 mW / -0.98 dBm Rcvr signal avg optical power(Channel 1) : 0.9325 mW / -0.30 dBm Rcvr signal avg optical power(Channel 2) : 0.9034 mW / -0.44 dBm Rcvr signal avg optical power(Channel 3) : 0.9086 mW / -0.42 dBm Rcvr signal avg optical power(Channel 4) : 0.8885 mW / -0.51 dBm Put the associated port administratively down which will instruct the host to transition the module to low power mode: # ip link set dev swp13 down Query the power mode again: $ ethtool --show-module swp13 Module parameters for swp13: power-mode-policy auto power-mode low Verify with the data read from the EEPROM: # ethtool -m swp13 Identifier : 0x11 (QSFP28) ... Extended identifier description : 5.0W max. Power consumption, High Power Class (> 3.5 W) not enabled Power set : On Power override : On ... Transmit avg optical power (Channel 1) : 0.0000 mW / -inf dBm Transmit avg optical power (Channel 2) : 0.0000 mW / -inf dBm Transmit avg optical power (Channel 3) : 0.0000 mW / -inf dBm Transmit avg optical power (Channel 4) : 0.0000 mW / -inf dBm Rcvr signal avg optical power(Channel 1) : 0.0000 mW / -inf dBm Rcvr signal avg optical power(Channel 2) : 0.0000 mW / -inf dBm Rcvr signal avg optical power(Channel 3) : 0.0000 mW / -inf dBm Rcvr signal avg optical power(Channel 4) : 0.0000 mW / -inf dBm Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-06hyper-v: Replace uuid.h with types.hAndy Shevchenko
There is no user of anything in uuid.h in the hyperv.h. Replace it with more appropriate types.h. Fixes: f081bbb3fd03 ("hyper-v: Remove internal types from UAPI header") Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://lore.kernel.org/r/20211001135544.1823-1-andriy.shevchenko@linux.intel.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-10-04ipv6: ioam: Add support for the ip6ip6 encapsulationJustin Iurman
This patch adds support for the ip6ip6 encapsulation by providing three encap modes: inline, encap and auto. Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-01Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Daniel Borkmann says: ==================== bpf-next 2021-10-02 We've added 85 non-merge commits during the last 15 day(s) which contain a total of 132 files changed, 13779 insertions(+), 6724 deletions(-). The main changes are: 1) Massive update on test_bpf.ko coverage for JITs as preparatory work for an upcoming MIPS eBPF JIT, from Johan Almbladh. 2) Add a batched interface for RX buffer allocation in AF_XDP buffer pool, with driver support for i40e and ice from Magnus Karlsson. 3) Add legacy uprobe support to libbpf to complement recently merged legacy kprobe support, from Andrii Nakryiko. 4) Add bpf_trace_vprintk() as variadic printk helper, from Dave Marchevsky. 5) Support saving the register state in verifier when spilling <8byte bounded scalar to the stack, from Martin Lau. 6) Add libbpf opt-in for stricter BPF program section name handling as part of libbpf 1.0 effort, from Andrii Nakryiko. 7) Add a document to help clarifying BPF licensing, from Alexei Starovoitov. 8) Fix skel_internal.h to propagate errno if the loader indicates an internal error, from Kumar Kartikeya Dwivedi. 9) Fix build warnings with -Wcast-function-type so that the option can later be enabled by default for the kernel, from Kees Cook. 10) Fix libbpf to ignore STT_SECTION symbols in legacy map definitions as it otherwise errors out when encountering them, from Toke Høiland-Jørgensen. 11) Teach libbpf to recognize specialized maps (such as for perf RB) and internally remove BTF type IDs when creating them, from Hengqi Chen. 12) Various fixes and improvements to BPF selftests. ==================== Link: https://lore.kernel.org/r/20211002001327.15169-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-01devlink: report maximum number of snapshots with regionsJacob Keller
Each region has an independently configurable number of maximum snapshots. This information is not reported to userspace, making it not very discoverable. Fix this by adding a new DEVLINK_ATTR_REGION_MAX_SNAPSHOST attribute which is used to report this maximum. Ex: $devlink region pci/0000:af:00.0/nvm-flash: size 10485760 snapshot [] max 1 pci/0000:af:00.0/device-caps: size 4096 snapshot [] max 10 pci/0000:af:00.1/nvm-flash: size 10485760 snapshot [] max 1 pci/0000:af:00.1/device-caps: size 4096 snapshot [] max 10 This information enables users to understand why a new region command may fail due to having too many existing snapshots. Reported-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
drivers/net/phy/bcm7xxx.c d88fd1b546ff ("net: phy: bcm7xxx: Fixed indirect MMD operations") f68d08c437f9 ("net: phy: bcm7xxx: Add EPHY entry for 72165") net/sched/sch_api.c b193e15ac69d ("net: prevent user from passing illegal stab size") 69508d43334e ("net_sched: Use struct_size() and flex_array_size() helpers") Both cases trivial - adjacent code additions. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-09-30net: add new socket option SO_RESERVE_MEMWei Wang
This socket option provides a mechanism for users to reserve a certain amount of memory for the socket to use. When this option is set, kernel charges the user specified amount of memory to memcg, as well as sk_forward_alloc. This amount of memory is not reclaimable and is available in sk_forward_alloc for this socket. With this socket option set, the networking stack spends less cycles doing forward alloc and reclaim, which should lead to better system performance, with the cost of an amount of pre-allocated and unreclaimable memory, even under memory pressure. Note: This socket option is only available when memory cgroup is enabled and we require this reserved memory to be charged to the user's memcg. We hope this could avoid mis-behaving users to abused this feature to reserve a large amount on certain sockets and cause unfairness for others. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-29Merge tag 'sound-5.15-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "This became a slightly large collection of changes, partly because I've been off in the last weeks. Most of changes are small and scattered while a bit big change is found in HD-audio Realtek codec driver; it's a very device-specific fix that has been long wanted, so I decided to pick up although it's in the middle RC. Some highlights: - A new guard ioctl for ALSA rawmidi API to avoid the misuse of the new timestamp framing mode; it's for a regression fix - HD-audio: a revert of the 5.15 change that might work badly, new quirks for Lenovo Legion & co, a follow-up fix for CS8409 - ASoC: lots of SOF-related fixes, fsl component fixes, corrections of mediatek drivers - USB-audio: fix for the PM resume - FireWire: oxfw and motu fixes" * tag 'sound-5.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (25 commits) ALSA: pcsp: Make hrtimer forwarding more robust ALSA: rawmidi: introduce SNDRV_RAWMIDI_IOCTL_USER_PVERSION ALSA: firewire-motu: fix truncated bytes in message tracepoints ASoC: SOF: trace: Omit error print when waking up trace sleepers ASoC: mediatek: mt8195: remove wrong fixup assignment on HDMITX ASoC: SOF: loader: Re-phrase the missing firmware error to avoid duplication ASoC: SOF: loader: release_firmware() on load failure to avoid batching ALSA: hda/cs8409: Setup Dolphin Headset Mic as Phantom Jack ALSA: pcxhr: "fix" PCXHR_REG_TO_PORT definition ASoC: SOF: imx: imx8m: Bar index is only valid for IRAM and SRAM types ASoC: SOF: imx: imx8: Bar index is only valid for IRAM and SRAM types ASoC: SOF: Fix DSP oops stack dump output contents ALSA: hda/realtek: Quirks to enable speaker output for Lenovo Legion 7i 15IMHG05, Yoga 7i 14ITL5/15ITL5, and 13s Gen2 laptops. ALSA: usb-audio: Unify mixer resume and reset_resume procedure Revert "ALSA: hda: Drop workaround for a hang at shutdown again" ALSA: oxfw: fix transmission method for Loud models based on OXFW971 ASoC: mediatek: common: handle NULL case in suspend/resume function ASoC: fsl_xcvr: register platform component before registering cpu dai ASoC: fsl_spdif: register platform component before registering cpu dai ASoC: fsl_micfil: register platform component before registering cpu dai ...
2021-09-25Merge tag 'char-misc-5.15-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver fixes from Greg KH: "Here are some small char and misc driver fixes for 5.15-rc3. Nothing huge in here, just fixes for a number of small issues that have been reported. These include: - habanalabs race conditions and other bugs fixed - binder driver fixes - fpga driver fixes - coresight build warning fix - nvmem driver fix - comedi memory leak fix - bcm-vk tty race fix - other tiny driver fixes All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-5.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (21 commits) comedi: Fix memory leak in compat_insnlist() nvmem: NVMEM_NINTENDO_OTP should depend on WII misc: bcm-vk: fix tty registration race fpga: dfl: Avoid reads to AFU CSRs during enumeration fpga: machxo2-spi: Fix missing error code in machxo2_write_complete() fpga: machxo2-spi: Return an error on failure habanalabs: expose a single cs seq in staged submissions habanalabs: fix wait offset handling habanalabs: rate limit multi CS completion errors habanalabs/gaudi: fix LBW RR configuration habanalabs: Fix spelling mistake "FEADBACK" -> "FEEDBACK" habanalabs: fail collective wait when not supported habanalabs/gaudi: use direct MSI in single mode habanalabs: fix kernel OOPs related to staged cs habanalabs: fix potential race in interrupt wait ioctl mcb: fix error handling in mcb_alloc_bus() misc: genwqe: Fixes DMA mask setting coresight: syscfg: Fix compiler warning nvmem: core: Add stubs for nvmem_cell_read_variable_le_u32/64 if !CONFIG_NVMEM binder: make sure fd closes complete ...
2021-09-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
net/mptcp/protocol.c 977d293e23b4 ("mptcp: ensure tx skbs always have the MPTCP ext") efe686ffce01 ("mptcp: ensure tx skbs always have the MPTCP ext") same patch merged in both trees, keep net-next. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-09-23ALSA: rawmidi: introduce SNDRV_RAWMIDI_IOCTL_USER_PVERSIONJaroslav Kysela
The new framing mode causes the user space regression, because the alsa-lib code does not initialize the reserved space in the params structure when the device is opened. This change adds SNDRV_RAWMIDI_IOCTL_USER_PVERSION like we do for the PCM interface for the protocol acknowledgment. Cc: David Henningsson <coding@diwic.se> Cc: <stable@vger.kernel.org> Fixes: 08fdced60ca0 ("ALSA: rawmidi: Add framing mode") BugLink: https://github.com/alsa-project/alsa-lib/issues/178 Signed-off-by: Jaroslav Kysela <perex@perex.cz> Link: https://lore.kernel.org/r/20210920171850.154186-1-perex@perex.cz Signed-off-by: Takashi Iwai <tiwai@suse.de>
2021-09-20Merge tag '5.15-rc1-smb3' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs client fixes from Steve French: - two deferred close fixes (for bugs found with xfstests 478 and 461) - a deferred close improvement in rename - two trivial fixes for incorrect Linux comment formatting of multiple cifs files (pointed out by automated kernel test robot and checkpatch) * tag '5.15-rc1-smb3' of git://git.samba.org/sfrench/cifs-2.6: cifs: Not to defer close on file when lock is set cifs: Fix soft lockup during fsstress cifs: Deferred close performance improvements cifs: fix incorrect kernel doc comments cifs: remove pathname for file from SPDX header
2021-09-18mptcp: add MPTCP_SUBFLOW_ADDRS getsockopt supportFlorian Westphal
This retrieves the address pairs of all subflows currently active for a given mptcp connection. It re-uses the same meta-header as for MPTCP_TCPINFO. A new structure is provided to hold the subflow address data: struct mptcp_subflow_addrs { union { __kernel_sa_family_t sa_family; struct sockaddr sa_local; struct sockaddr_in sin_local; struct sockaddr_in6 sin6_local; struct sockaddr_storage ss_local; }; union { struct sockaddr sa_remote; struct sockaddr_in sin_remote; struct sockaddr_in6 sin6_remote; struct sockaddr_storage ss_remote; }; }; Usage of the new getsockopt is very similar to MPTCP_TCPINFO one. Userspace allocates a 'struct mptcp_subflow_data', followed by one or more 'struct mptcp_subflow_addrs', then inits the mptcp_subflow_data structure as follows: struct mptcp_subflow_addrs *sf_addr; struct mptcp_subflow_data *addr; socklen_t olen = sizeof(*addr) + (8 * sizeof(*sf_addr)); addr = malloc(olen); addr->size_subflow_data = sizeof(*addr); addr->num_subflows = 0; addr->size_kernel = 0; addr->size_user = sizeof(struct mptcp_subflow_addrs); sf_addr = (struct mptcp_subflow_addrs *)(addr + 1); and then retrieves the endpoint addresses via: ret = getsockopt(fd, SOL_MPTCP, MPTCP_SUBFLOW_ADDRS, addr, &olen); If the call succeeds, kernel will have added up to 8 endpoint addresses after the 'mptcp_subflow_data' header. Userspace needs to re-check 'olen' value to detect how many bytes have been filled in by the kernel. Userspace can check addr->num_subflows to discover when there were more subflows that available data space. Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-18mptcp: add MPTCP_TCPINFO getsockopt supportFlorian Westphal
Allow users to retrieve TCP_INFO data of all subflows. Users need to pre-initialize a meta header that has to be prepended to the data buffer that will be filled with the tcp info data. The meta header looks like this: struct mptcp_subflow_data { __u32 size_subflow_data;/* size of this structure in userspace */ __u32 num_subflows; /* must be 0, set by kernel */ __u32 size_kernel; /* must be 0, set by kernel */ __u32 size_user; /* size of one element in data[] */ } __attribute__((aligned(8))); size_subflow_data has to be set to 'sizeof(struct mptcp_subflow_data)'. This allows to extend mptcp_subflow_data structure later on without breaking backwards compatibility. If the structure is extended later on, kernel knows where the userspace-provided meta header ends, even if userspace uses an older (smaller) version of the structure. num_subflows must be set to 0. If the getsockopt request succeeds (return value is 0), it will be updated to contain the number of active subflows for the given logical connection. size_kernel must be set to 0. If the getsockopt request is successful, it will contain the size of the 'struct tcp_info' as known by the kernel. This is informational only. size_user must be set to 'sizeof(struct tcp_info)'. This allows the kernel to only fill in the space reserved/expected by userspace. Example: struct my_tcp_info { struct mptcp_subflow_data d; struct tcp_info ti[2]; }; struct my_tcp_info ti; socklen_t olen; memset(&ti, 0, sizeof(ti)); ti.d.size_subflow_data = sizeof(struct mptcp_subflow_data); ti.d.size_user = sizeof(struct tcp_info); olen = sizeof(ti); ret = getsockopt(fd, SOL_MPTCP, MPTCP_TCPINFO, &ti, &olen); if (ret < 0) die_perror("getsockopt MPTCP_TCPINFO"); mptcp_subflow_data.num_subflows is populated with the number of subflows that exist on the kernel side for the logical mptcp connection. This allows userspace to re-try with a larger tcp_info array if the number of subflows was larger than the available space in the ti[] array. olen has to be set to the number of bytes that userspace has allocated to receive the kernel data. It will be updated to contain the real number bytes that have been copied to by the kernel. In the above example, if the number if subflows was 1, olen is equal to 'sizeof(struct mptcp_subflow_data) + sizeof(struct tcp_info). For 2 or more subflows olen is equal to 'sizeof(struct my_tcp_info)'. If there was more data that could not be copied due to lack of space in the option buffer, userspace can detect this by checking mptcp_subflow_data->num_subflows. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-18mptcp: add MPTCP_INFO getsockoptFlorian Westphal
Its not compatible with multipath-tcp.org kernel one. 1. The out-of-tree implementation defines a different 'struct mptcp_info', with embedded __user addresses for additional data such as endpoint addresses. 2. Mat Martineau points out that embedded __user addresses doesn't work with BPF_CGROUP_RUN_PROG_GETSOCKOPT() which assumes that copying in optsize bytes from optval provides all data that got copied to userspace. This provides mptcp_info data for the given mptcp socket. Userspace sets optlen to the size of the structure it expects. The kernel updates it to contain the number of bytes that it copied. This allows to append more information to the structure later. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-17bpf: Clarify data_len param in bpf_snprintf and bpf_seq_printf commentsDave Marchevsky
Since the data_len in these two functions is a byte len of the preceding u64 *data array, it must always be a multiple of 8. If this isn't the case both helpers error out, so let's make the requirement explicit so users don't need to infer it. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210917182911.2426606-10-davemarchevsky@fb.com
2021-09-17bpf: Add bpf_trace_vprintk helperDave Marchevsky
This helper is meant to be "bpf_trace_printk, but with proper vararg support". Follow bpf_snprintf's example and take a u64 pseudo-vararg array. Write to /sys/kernel/debug/tracing/trace_pipe using the same mechanism as bpf_trace_printk. The functionality of this helper was requested in the libbpf issue tracker [0]. [0] Closes: https://github.com/libbpf/libbpf/issues/315 Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210917182911.2426606-4-davemarchevsky@fb.com
2021-09-17Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-09-17 We've added 63 non-merge commits during the last 12 day(s) which contain a total of 65 files changed, 2653 insertions(+), 751 deletions(-). The main changes are: 1) Streamline internal BPF program sections handling and bpf_program__set_attach_target() in libbpf, from Andrii. 2) Add support for new btf kind BTF_KIND_TAG, from Yonghong. 3) Introduce bpf_get_branch_snapshot() to capture LBR, from Song. 4) IMUL optimization for x86-64 JIT, from Jie. 5) xsk selftest improvements, from Magnus. 6) Introduce legacy kprobe events support in libbpf, from Rafael. 7) Access hw timestamp through BPF's __sk_buff, from Vadim. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (63 commits) selftests/bpf: Fix a few compiler warnings libbpf: Constify all high-level program attach APIs libbpf: Schedule open_opts.attach_prog_fd deprecation since v0.7 selftests/bpf: Switch fexit_bpf2bpf selftest to set_attach_target() API libbpf: Allow skipping attach_func_name in bpf_program__set_attach_target() libbpf: Deprecated bpf_object_open_opts.relaxed_core_relocs selftests/bpf: Stop using relaxed_core_relocs which has no effect libbpf: Use pre-setup sec_def in libbpf_find_attach_btf_id() bpf: Update bpf_get_smp_processor_id() documentation libbpf: Add sphinx code documentation comments selftests/bpf: Skip btf_tag test if btf_tag attribute not supported docs/bpf: Add documentation for BTF_KIND_TAG selftests/bpf: Add a test with a bpf program with btf_tag attributes selftests/bpf: Test BTF_KIND_TAG for deduplication selftests/bpf: Add BTF_KIND_TAG unit tests selftests/bpf: Change NAME_NTH/IS_NAME_NTH for BTF_KIND_TAG format selftests/bpf: Test libbpf API function btf__add_tag() bpftool: Add support for BTF_KIND_TAG libbpf: Add support for BTF_KIND_TAG libbpf: Rename btf_{hash,equal}_int to btf_{hash,equal}_int_tag ... ==================== Link: https://lore.kernel.org/r/20210917173738.3397064-1-ast@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-09-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts! Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-09-16net/tls: support SM4 GCM/CCM algorithmTianjia Zhang
The RFC8998 specification defines the use of the ShangMi algorithm cipher suites in TLS 1.3, and also supports the GCM/CCM mode using the SM4 algorithm. Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-15bpf: Update bpf_get_smp_processor_id() documentationMatteo Croce
BPF programs run with migration disabled regardless of preemption, as they are protected by migrate_disable(). Update the uapi documentation accordingly. Signed-off-by: Matteo Croce <mcroce@microsoft.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210914235400.59427-1-mcroce@linux.microsoft.com
2021-09-15xfrm: make user policy API completeNicolas Dichtel
>From a userland POV, this API was based on some magic values: - dirmask and action were bitfields but meaning of bits (XFRM_POL_DEFAULT_*) are not exported; - action is confusing, if a bit is set, does it mean drop or accept? Let's try to simplify this uapi by using explicit field and macros. Fixes: 2d151d39073a ("xfrm: Add possibility to set the default to block if we have no policy") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-09-14bpf: Support for new btf kind BTF_KIND_TAGYonghong Song
LLVM14 added support for a new C attribute ([1]) __attribute__((btf_tag("arbitrary_str"))) This attribute will be emitted to dwarf ([2]) and pahole will convert it to BTF. Or for bpf target, this attribute will be emitted to BTF directly ([3], [4]). The attribute is intended to provide additional information for - struct/union type or struct/union member - static/global variables - static/global function or function parameter. For linux kernel, the btf_tag can be applied in various places to specify user pointer, function pre- or post- condition, function allow/deny in certain context, etc. Such information will be encoded in vmlinux BTF and can be used by verifier. The btf_tag can also be applied to bpf programs to help global verifiable functions, e.g., specifying preconditions, etc. This patch added basic parsing and checking support in kernel for new BTF_KIND_TAG kind. [1] https://reviews.llvm.org/D106614 [2] https://reviews.llvm.org/D106621 [3] https://reviews.llvm.org/D106622 [4] https://reviews.llvm.org/D109560 Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210914223015.245546-1-yhs@fb.com
2021-09-14btf: Change BTF_KIND_* macros to enumsYonghong Song
Change BTF_KIND_* macros to enums so they are encoded in dwarf and appear in vmlinux.h. This will make it easier for bpf programs to use these constants without macro definitions. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210914223009.245307-1-yhs@fb.com
2021-09-14net/smc: add generic netlink support for system EIDKarsten Graul
With SMC-Dv2 users can configure if the static system EID should be used during CLC handshake, or if only user EIDs are allowed. Add generic netlink support to enable and disable the system EID, and to retrieve the system EID and its current enabled state. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-14net/smc: add support for user defined EIDsKarsten Graul
SMC-Dv2 allows users to define EIDs which allows to create separate name spaces enabling users to cluster their SMC-Dv2 connections. Add support for user defined EIDs and extent the generic netlink interface so users can add, remove and dump EIDs. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-14include/uapi/linux/xfrm.h: Fix XFRM_MSG_MAPPING ABI breakageEugene Syromiatnikov
Commit 2d151d39073a ("xfrm: Add possibility to set the default to block if we have no policy") broke ABI by changing the value of the XFRM_MSG_MAPPING enum item, thus also evading the build-time check in security/selinux/nlmsgtab.c:selinux_nlmsg_lookup for presence of proper security permission checks in nlmsg_xfrm_perms. Fix it by placing XFRM_MSG_SETDEFAULT/XFRM_MSG_GETDEFAULT to the end of the enum, right before __XFRM_MSG_MAX, and updating the nlmsg_xfrm_perms accordingly. Fixes: 2d151d39073a ("xfrm: Add possibility to set the default to block if we have no policy") References: https://lore.kernel.org/netdev/20210901151402.GA2557@altlinux.org/ Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Acked-by: Antony Antony <antony.antony@secunet.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-09-14binder: fix freeze raceLi Li
Currently cgroup freezer is used to freeze the application threads, and BINDER_FREEZE is used to freeze the corresponding binder interface. There's already a mechanism in ioctl(BINDER_FREEZE) to wait for any existing transactions to drain out before actually freezing the binder interface. But freezing an app requires 2 steps, freezing the binder interface with ioctl(BINDER_FREEZE) and then freezing the application main threads with cgroupfs. This is not an atomic operation. The following race issue might happen. 1) Binder interface is frozen by ioctl(BINDER_FREEZE); 2) Main thread A initiates a new sync binder transaction to process B; 3) Main thread A is frozen by "echo 1 > cgroup.freeze"; 4) The response from process B reaches the frozen thread, which will unexpectedly fail. This patch provides a mechanism to check if there's any new pending transaction happening between ioctl(BINDER_FREEZE) and freezing the main thread. If there's any, the main thread freezing operation can be rolled back to finish the pending transaction. Furthermore, the response might reach the binder driver before the rollback actually happens. That will still cause failed transaction. As the other process doesn't wait for another response of the response, the response transaction failure can be fixed by treating the response transaction like an oneway/async one, allowing it to reach the frozen thread. And it will be consumed when the thread gets unfrozen later. NOTE: This patch reuses the existing definition of struct binder_frozen_status_info but expands the bit assignments of __u32 member sync_recv. To ensure backward compatibility, bit 0 of sync_recv still indicates there's an outstanding sync binder transaction. This patch adds new information to bit 1 of sync_recv, indicating the binder transaction happens exactly when there's a race. If an existing userspace app runs on a new kernel, a sync binder call will set bit 0 of sync_recv so ioctl(BINDER_GET_FROZEN_INFO) still return the expected value (true). The app just doesn't check bit 1 intentionally so it doesn't have the ability to tell if there's a race. This behavior is aligned with what happens on an old kernel which doesn't set bit 1 at all. A new userspace app can 1) check bit 0 to know if there's a sync binder transaction happened when being frozen - same as before; and 2) check bit 1 to know if that sync binder transaction happened exactly when there's a race - a new information for rollback decision. the same time, confirmed the pending transactions succeeded. Fixes: 432ff1e91694 ("binder: BINDER_FREEZE ioctl") Acked-by: Todd Kjos <tkjos@google.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Li Li <dualli@google.com> Test: stress test with apps being frozen and initiating binder calls at Link: https://lore.kernel.org/r/20210910164210.2282716-2-dualli@chromium.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-13cifs: remove pathname for file from SPDX headerSteve French
checkpatch complains about source files with filenames (e.g. in these cases just below the SPDX header in comments at the top of various files in fs/cifs). It also is helpful to change this now so will be less confusing when the parent directory is renamed e.g. from fs/cifs to fs/smb_client (or fs/smbfs) Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-09-13bpf: Introduce helper bpf_get_branch_snapshotSong Liu
Introduce bpf_get_branch_snapshot(), which allows tracing pogram to get branch trace from hardware (e.g. Intel LBR). To use the feature, the user need to create perf_event with proper branch_record filtering on each cpu, and then calls bpf_get_branch_snapshot in the bpf function. On Intel CPUs, VLBR event (raw event 0x1b00) can be use for this. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210910183352.3151445-3-songliubraving@fb.com
2021-09-13io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg itemsEugene Syromiatnikov
The items passed in the array pointed by the arg parameter of IORING_REGISTER_IOWQ_MAX_WORKERS io_uring_register operation carry certain semantics: they refer to different io-wq worker categories; provide IO_WQ_* constants in the UAPI, so these categories can be referenced in the user space code. Suggested-by: Jens Axboe <axboe@kernel.dk> Complements: 2e480058ddc21ec5 ("io-wq: provide a way to limit max number of workers") Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Link: https://lore.kernel.org/r/20210913154415.GA12890@asgard.redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-11Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: - vduse driver ("vDPA Device in Userspace") supporting emulated virtio block devices - virtio-vsock support for end of record with SEQPACKET - vdpa: mac and mq support for ifcvf and mlx5 - vdpa: management netlink for ifcvf - virtio-i2c, gpio dt bindings - misc fixes and cleanups * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (39 commits) Documentation: Add documentation for VDUSE vduse: Introduce VDUSE - vDPA Device in Userspace vduse: Implement an MMU-based software IOTLB vdpa: Support transferring virtual addressing during DMA mapping vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap() vdpa: Add an opaque pointer for vdpa_config_ops.dma_map() vhost-iotlb: Add an opaque pointer for vhost IOTLB vhost-vdpa: Handle the failure of vdpa_reset() vdpa: Add reset callback in vdpa_config_ops vdpa: Fix some coding style issues file: Export receive_fd() to modules eventfd: Export eventfd_wake_count to modules iova: Export alloc_iova_fast() and free_iova_fast() virtio-blk: remove unneeded "likely" statements virtio-balloon: Use virtio_find_vqs() helper vdpa: Make use of PFN_PHYS/PFN_UP/PFN_DOWN helper macro vsock_test: update message bounds test for MSG_EOR af_vsock: rename variables in receive loop virtio/vsock: support MSG_EOR bit processing vhost/vsock: support MSG_EOR bit processing ...
2021-09-10bpf: Add hardware timestamp field to __sk_buffVadim Fedorenko
BPF programs may want to know hardware timestamps if NIC supports such timestamping. Expose this data as hwtstamp field of __sk_buff the same way as gso_segs/gso_size. This field could be accessed from the same programs as tstamp field, but it's read-only field. Explicit test to deny access to padding data is added to bpf_skb_is_valid_access. Also update BPF_PROG_TEST_RUN tests of the feature. Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210909220409.8804-2-vfedorenko@novek.ru
2021-09-10Merge tag 'char-misc-5.15-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull habanalabs updates from Greg KH: "Here is another round of misc driver patches for 5.15-rc1. In here is only updates for the Habanalabs driver. This request is late because the previously-objected-to dma-buf patches are all removed and some fixes that you and others found are now included in here as well. All of these have been in linux-next for well over a week with no reports of problems, and they are all self-contained to only this one driver. Full details are in the shortlog" * tag 'char-misc-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (61 commits) habanalabs/gaudi: hwmon default card name habanalabs: add support for f/w reset habanalabs/gaudi: block ICACHE_BASE_ADDERESS_HIGH in TPC habanalabs: cannot sleep while holding spinlock habanalabs: never copy_from_user inside spinlock habanalabs: remove unnecessary device status check habanalabs: disable IRQ in user interrupts spinlock habanalabs: add "in device creation" status habanalabs/gaudi: invalidate PMMU mem cache on init habanalabs/gaudi: size should be printed in decimal habanalabs/gaudi: define DC POWER for secured PMC habanalabs/gaudi: unmask out of bounds SLM access interrupt habanalabs: add userptr_lookup node in debugfs habanalabs/gaudi: fetch TPC/MME ECC errors from F/W habanalabs: modify multi-CS to wait on stream masters habanalabs/gaudi: add monitored SOBs to state dump habanalabs/gaudi: restore user registers when context opens habanalabs/gaudi: increase boot fit timeout habanalabs: update to latest firmware headers habanalabs/gaudi: minimize number of register reads ...
2021-09-09Merge tag 'for-linus-5.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml Pull UML updates from Richard Weinberger: - Support for VMAP_STACK - Support for splice_write in hostfs - Fixes for virt-pci - Fixes for virtio_uml - Various fixes * tag 'for-linus-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: um: fix stub location calculation um: virt-pci: fix uapi documentation um: enable VMAP_STACK um: virt-pci: don't do DMA from stack hostfs: support splice_write um: virtio_uml: fix memory leak on init failures um: virtio_uml: include linux/virtio-uml.h lib/logic_iomem: fix sparse warnings um: make PCI emulation driver init/exit static
2021-09-09Merge tag 'cxl-for-5.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull CXL (Compute Express Link) updates from Dan Williams: - Fix detection of CXL host bridges to filter out disabled ACPI0016 devices in the ACPI DSDT. - Fix kernel lockdown integration to disable raw commands when raw PCI access is disabled. - Fix a broken debug message. - Add support for "Get Partition Info". I.e. enumerate the split between volatile and persistent capacity on bi-modal CXL memory expanders. - Re-factor the core by subject area. This is a work in progress. - Prepare libnvdimm to understand CXL labels in addition to EFI labels. This is a work in progress. * tag 'cxl-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (25 commits) cxl/registers: Fix Documentation warning cxl/pmem: Fix Documentation warning cxl/uapi: Fix defined but not used warnings cxl/pci: Fix debug message in cxl_probe_regs() cxl/pci: Fix lockdown level cxl/acpi: Do not add DSDT disabled ACPI0016 host bridge ports libnvdimm/labels: Add claim class helpers libnvdimm/labels: Add type-guid helpers libnvdimm/labels: Add blk special cases for nlabel and position helpers libnvdimm/labels: Add blk isetcookie set / validation helpers libnvdimm/labels: Add a checksum calculation helper libnvdimm/labels: Introduce label setter helpers libnvdimm/labels: Add isetcookie validation helper libnvdimm/labels: Introduce getters for namespace label fields cxl/mem: Adjust ram/pmem range to represent DPA ranges cxl/mem: Account for partitionable space in ram/pmem ranges cxl/pci: Store memory capacity values cxl/pci: Simplify register setup cxl/pci: Ignore unknown register block types cxl/core: Move memdev management to core ...
2021-09-09Merge tag 'dmaengine-5.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine Pull dmaengine updates from Vinod Koul: "New drivers/devices - Support for Renesas RZ/G2L dma controller - New driver for AMD PTDMA controller Updates: - Big pile of idxd updates - Updates for Altera driver, stm32-dma, dw etc" * tag 'dmaengine-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (83 commits) dmaengine: sh: fix some NULL dereferences dmaengine: sh: Fix unused initialization of pointer lmdesc MAINTAINERS: Fix AMD PTDMA DRIVER entry dmaengine: ptdma: remove PT_OFFSET to avoid redefnition dmaengine: ptdma: Add debugfs entries for PTDMA dmaengine: ptdma: register PTDMA controller as a DMA resource dmaengine: ptdma: Initial driver for the AMD PTDMA dmaengine: fsl-dpaa2-qdma: Fix spelling mistake "faile" -> "failed" dmaengine: idxd: remove interrupt disable for dev_lock dmaengine: idxd: remove interrupt disable for cmd_lock dmaengine: idxd: fix setting up priv mode for dwq dmaengine: xilinx_dma: Set DMA mask for coherent APIs dmaengine: ti: k3-psil-j721e: Add entry for CSI2RX dmaengine: sh: Add DMAC driver for RZ/G2L SoC dmaengine: Extend the dma_slave_width for 128 bytes dt-bindings: dma: Document RZ/G2L bindings dmaengine: ioat: depends on !UML dmaengine: idxd: set descriptor allocation size to threshold for swq dmaengine: idxd: make submit failure path consistent on desc freeing dmaengine: idxd: remove interrupt flag for completion list spinlock ...