summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet
AgeCommit message (Collapse)Author
5 daysnet: mvpp2: guard flow control update with global_tx_fc in buffer switchingMuhammad Hammad Ijaz
mvpp2_bm_switch_buffers() unconditionally calls mvpp2_bm_pool_update_priv_fc() when switching between per-cpu and shared buffer pool modes. This function programs CM3 flow control registers via mvpp2_cm3_read()/mvpp2_cm3_write(), which dereference priv->cm3_base without any NULL check. When the CM3 SRAM resource is not present in the device tree (the third reg entry added by commit 60523583b07c ("dts: marvell: add CM3 SRAM memory to cp11x ethernet device tree")), priv->cm3_base remains NULL and priv->global_tx_fc is false. Any operation that triggers mvpp2_bm_switch_buffers(), for example an MTU change that crosses the jumbo frame threshold, will crash: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 Mem abort info: ESR = 0x0000000096000006 EC = 0x25: DABT (current EL), IL = 32 bits pc : readl+0x0/0x18 lr : mvpp2_cm3_read.isra.0+0x14/0x20 Call trace: readl+0x0/0x18 mvpp2_bm_pool_update_fc+0x40/0x12c mvpp2_bm_pool_update_priv_fc+0x94/0xd8 mvpp2_bm_switch_buffers.isra.0+0x80/0x1c0 mvpp2_change_mtu+0x140/0x380 __dev_set_mtu+0x1c/0x38 dev_set_mtu_ext+0x78/0x118 dev_set_mtu+0x48/0xa8 dev_ifsioc+0x21c/0x43c dev_ioctl+0x2d8/0x42c sock_ioctl+0x314/0x378 Every other flow control call site in the driver already guards hardware access with either priv->global_tx_fc or port->tx_fc. mvpp2_bm_switch_buffers() is the only place that omits this check. Add the missing priv->global_tx_fc guard to both the disable and re-enable calls in mvpp2_bm_switch_buffers(), consistent with the rest of the driver. Fixes: 3a616b92a9d1 ("net: mvpp2: Add TX flow control support for jumbo frames") Signed-off-by: Muhammad Hammad Ijaz <mhijaz@amazon.com> Reviewed-by: Gunnar Kudrjavets <gunnarku@amazon.com> Link: https://patch.msgid.link/20260316193157.65748-1-mhijaz@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 daysnet/mlx5e: Fix race condition during IPSec ESN updateJianbo Liu
In IPSec full offload mode, the device reports an ESN (Extended Sequence Number) wrap event to the driver. The driver validates this event by querying the IPSec ASO and checking that the esn_event_arm field is 0x0, which indicates an event has occurred. After handling the event, the driver must re-arm the context by setting esn_event_arm back to 0x1. A race condition exists in this handling path. After validating the event, the driver calls mlx5_accel_esp_modify_xfrm() to update the kernel's xfrm state. This function temporarily releases and re-acquires the xfrm state lock. So, need to acknowledge the event first by setting esn_event_arm to 0x1. This prevents the driver from reprocessing the same ESN update if the hardware sends events for other reason. Since the next ESN update only occurs after nearly 2^31 packets are received, there's no risk of missing an update, as it will happen long after this handling has finished. Processing the event twice causes the ESN high-order bits (esn_msb) to be incremented incorrectly. The driver then programs the hardware with this invalid ESN state, which leads to anti-replay failures and a complete halt of IPSec traffic. Fix this by re-arming the ESN event immediately after it is validated, before calling mlx5_accel_esp_modify_xfrm(). This ensures that any spurious, duplicate events are correctly ignored, closing the race window. Fixes: fef06678931f ("net/mlx5e: Fix ESN update kernel panic") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260316094603.6999-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 daysnet/mlx5e: Prevent concurrent access to IPSec ASO contextJianbo Liu
The query or updating IPSec offload object is through Access ASO WQE. The driver uses a single mlx5e_ipsec_aso struct for each PF, which contains a shared DMA-mapped context for all ASO operations. A race condition exists because the ASO spinlock is released before the hardware has finished processing WQE. If a second operation is initiated immediately after, it overwrites the shared context in the DMA area. When the first operation's completion is processed later, it reads this corrupted context, leading to unexpected behavior and incorrect results. This commit fixes the race by introducing a private context within each IPSec offload object. The shared ASO context is now copied to this private context while the ASO spinlock is held. Subsequent processing uses this saved, per-object context, ensuring its integrity is maintained. Fixes: 1ed78fc03307 ("net/mlx5e: Update IPsec soft and hard limits") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260316094603.6999-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 daysnet/mlx5: qos: Restrict RTNL area to avoid a lock cycleCosmin Ratiu
A lock dependency cycle exists where: 1. mlx5_ib_roce_init -> mlx5_core_uplink_netdev_event_replay -> mlx5_blocking_notifier_call_chain (takes notifier_rwsem) -> mlx5e_mdev_notifier_event -> mlx5_netdev_notifier_register -> register_netdevice_notifier_dev_net (takes rtnl) => notifier_rwsem -> rtnl 2. mlx5e_probe -> _mlx5e_probe -> mlx5_core_uplink_netdev_set (takes uplink_netdev_lock) -> mlx5_blocking_notifier_call_chain (takes notifier_rwsem) => uplink_netdev_lock -> notifier_rwsem 3: devlink_nl_rate_set_doit -> devlink_nl_rate_set -> mlx5_esw_devlink_rate_leaf_tx_max_set -> esw_qos_devlink_rate_to_mbps -> mlx5_esw_qos_max_link_speed_get (takes rtnl) -> mlx5_esw_qos_lag_link_speed_get_locked -> mlx5_uplink_netdev_get (takes uplink_netdev_lock) => rtnl -> uplink_netdev_lock => BOOM! (lock cycle) Fix that by restricting the rtnl-protected section to just the necessary part, the call to netdev_master_upper_dev_get and speed querying, so that the last lock dependency is avoided and the cycle doesn't close. This is safe because mlx5_uplink_netdev_get uses netdev_hold to keep the uplink netdev alive while its master device is queried. Use this opportunity to rename the ambiguously-named "hold_rtnl_lock" argument to "take_rtnl" and remove the "_locked" suffix from mlx5_esw_qos_lag_link_speed_get_locked. Fixes: 6b4be64fd9fe ("net/mlx5e: Harden uplink netdev access against device unbind") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260316094603.6999-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 daysMerge branch '1GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2026-03-17 (igc, iavf, libie) Kohei Enju adds use of helper function to add missing update of skb->tail when padding is needed for igc. Zdenek Bouska clears stale XSK timestamps when taking down Tx rings on igc. Petr Oros changes handling of iavf VLAN filter handling when an added VLAN is also on the delete list to which can race and cause the VLAN filter to not be added. Michal frees cmd_buf for libie firmware logging to stop memory leaks. * '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: libie: prevent memleak in fwlog code iavf: fix VLAN filter lost on add/delete race igc: fix page fault in XDP TX timestamps handling igc: fix missing update of skb->tail in igc_xmit_frame() ==================== Link: https://patch.msgid.link/20260317211906.115505-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 daysnet: macb: fix uninitialized rx_fs_lockFedor Pchelkin
If hardware doesn't support RX Flow Filters, rx_fs_lock spinlock is not initialized leading to the following assertion splat triggerable via set_rxnfc callback. INFO: trying to register non-static key. The code is fine but needs lockdep annotation, or maybe you didn't initialize this object before use? turning off the locking correctness validator. CPU: 1 PID: 949 Comm: syz.0.6 Not tainted 6.1.164+ #113 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x8d/0xba lib/dump_stack.c:106 assign_lock_key kernel/locking/lockdep.c:974 [inline] register_lock_class+0x141b/0x17f0 kernel/locking/lockdep.c:1287 __lock_acquire+0x74f/0x6c40 kernel/locking/lockdep.c:4928 lock_acquire kernel/locking/lockdep.c:5662 [inline] lock_acquire+0x190/0x4b0 kernel/locking/lockdep.c:5627 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x33/0x50 kernel/locking/spinlock.c:162 gem_del_flow_filter drivers/net/ethernet/cadence/macb_main.c:3562 [inline] gem_set_rxnfc+0x533/0xac0 drivers/net/ethernet/cadence/macb_main.c:3667 ethtool_set_rxnfc+0x18c/0x280 net/ethtool/ioctl.c:961 __dev_ethtool net/ethtool/ioctl.c:2956 [inline] dev_ethtool+0x229c/0x6290 net/ethtool/ioctl.c:3095 dev_ioctl+0x637/0x1070 net/core/dev_ioctl.c:510 sock_do_ioctl+0x20d/0x2c0 net/socket.c:1215 sock_ioctl+0x577/0x6d0 net/socket.c:1320 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __x64_sys_ioctl+0x18c/0x210 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:46 [inline] do_syscall_64+0x35/0x80 arch/x86/entry/common.c:76 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 A more straightforward solution would be to always initialize rx_fs_lock, just like rx_fs_list. However, in this case the driver set_rxnfc callback would return with a rather confusing error code, e.g. -EINVAL. So deny set_rxnfc attempts directly if the RX filtering feature is not supported by hardware. Fixes: ae8223de3df5 ("net: macb: Added support for RX filtering") Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Link: https://patch.msgid.link/20260316103826.74506-2-pchelkin@ispras.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 daysnet: macb: fix use-after-free access to PTP clockFedor Pchelkin
PTP clock is registered on every opening of the interface and destroyed on every closing. However it may be accessed via get_ts_info ethtool call which is possible while the interface is just present in the kernel. BUG: KASAN: use-after-free in ptp_clock_index+0x47/0x50 drivers/ptp/ptp_clock.c:426 Read of size 4 at addr ffff8880194345cc by task syz.0.6/948 CPU: 1 PID: 948 Comm: syz.0.6 Not tainted 6.1.164+ #109 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x8d/0xba lib/dump_stack.c:106 print_address_description mm/kasan/report.c:316 [inline] print_report+0x17f/0x496 mm/kasan/report.c:420 kasan_report+0xd9/0x180 mm/kasan/report.c:524 ptp_clock_index+0x47/0x50 drivers/ptp/ptp_clock.c:426 gem_get_ts_info+0x138/0x1e0 drivers/net/ethernet/cadence/macb_main.c:3349 macb_get_ts_info+0x68/0xb0 drivers/net/ethernet/cadence/macb_main.c:3371 __ethtool_get_ts_info+0x17c/0x260 net/ethtool/common.c:558 ethtool_get_ts_info net/ethtool/ioctl.c:2367 [inline] __dev_ethtool net/ethtool/ioctl.c:3017 [inline] dev_ethtool+0x2b05/0x6290 net/ethtool/ioctl.c:3095 dev_ioctl+0x637/0x1070 net/core/dev_ioctl.c:510 sock_do_ioctl+0x20d/0x2c0 net/socket.c:1215 sock_ioctl+0x577/0x6d0 net/socket.c:1320 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __x64_sys_ioctl+0x18c/0x210 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:46 [inline] do_syscall_64+0x35/0x80 arch/x86/entry/common.c:76 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 </TASK> Allocated by task 457: kmalloc include/linux/slab.h:563 [inline] kzalloc include/linux/slab.h:699 [inline] ptp_clock_register+0x144/0x10e0 drivers/ptp/ptp_clock.c:235 gem_ptp_init+0x46f/0x930 drivers/net/ethernet/cadence/macb_ptp.c:375 macb_open+0x901/0xd10 drivers/net/ethernet/cadence/macb_main.c:2920 __dev_open+0x2ce/0x500 net/core/dev.c:1501 __dev_change_flags+0x56a/0x740 net/core/dev.c:8651 dev_change_flags+0x92/0x170 net/core/dev.c:8722 do_setlink+0xaf8/0x3a80 net/core/rtnetlink.c:2833 __rtnl_newlink+0xbf4/0x1940 net/core/rtnetlink.c:3608 rtnl_newlink+0x63/0xa0 net/core/rtnetlink.c:3655 rtnetlink_rcv_msg+0x3c6/0xed0 net/core/rtnetlink.c:6150 netlink_rcv_skb+0x15d/0x430 net/netlink/af_netlink.c:2511 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x6d7/0xa30 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x97e/0xeb0 net/netlink/af_netlink.c:1872 sock_sendmsg_nosec net/socket.c:718 [inline] __sock_sendmsg+0x14b/0x180 net/socket.c:730 __sys_sendto+0x320/0x3b0 net/socket.c:2152 __do_sys_sendto net/socket.c:2164 [inline] __se_sys_sendto net/socket.c:2160 [inline] __x64_sys_sendto+0xdc/0x1b0 net/socket.c:2160 do_syscall_x64 arch/x86/entry/common.c:46 [inline] do_syscall_64+0x35/0x80 arch/x86/entry/common.c:76 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Freed by task 938: kasan_slab_free include/linux/kasan.h:177 [inline] slab_free_hook mm/slub.c:1729 [inline] slab_free_freelist_hook mm/slub.c:1755 [inline] slab_free mm/slub.c:3687 [inline] __kmem_cache_free+0xbc/0x320 mm/slub.c:3700 device_release+0xa0/0x240 drivers/base/core.c:2507 kobject_cleanup lib/kobject.c:681 [inline] kobject_release lib/kobject.c:712 [inline] kref_put include/linux/kref.h:65 [inline] kobject_put+0x1cd/0x350 lib/kobject.c:729 put_device+0x1b/0x30 drivers/base/core.c:3805 ptp_clock_unregister+0x171/0x270 drivers/ptp/ptp_clock.c:391 gem_ptp_remove+0x4e/0x1f0 drivers/net/ethernet/cadence/macb_ptp.c:404 macb_close+0x1c8/0x270 drivers/net/ethernet/cadence/macb_main.c:2966 __dev_close_many+0x1b9/0x310 net/core/dev.c:1585 __dev_close net/core/dev.c:1597 [inline] __dev_change_flags+0x2bb/0x740 net/core/dev.c:8649 dev_change_flags+0x92/0x170 net/core/dev.c:8722 dev_ifsioc+0x151/0xe00 net/core/dev_ioctl.c:326 dev_ioctl+0x33e/0x1070 net/core/dev_ioctl.c:572 sock_do_ioctl+0x20d/0x2c0 net/socket.c:1215 sock_ioctl+0x577/0x6d0 net/socket.c:1320 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __x64_sys_ioctl+0x18c/0x210 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:46 [inline] do_syscall_64+0x35/0x80 arch/x86/entry/common.c:76 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Set the PTP clock pointer to NULL after unregistering. Fixes: c2594d804d5c ("macb: Common code to enable ptp support for MACB/GEM") Cc: stable@vger.kernel.org Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Link: https://patch.msgid.link/20260316103826.74506-1-pchelkin@ispras.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 daysbnxt_en: fix OOB access in DBG_BUF_PRODUCER async event handlerJunrui Luo
The ASYNC_EVENT_CMPL_EVENT_ID_DBG_BUF_PRODUCER handler in bnxt_async_event_process() uses a firmware-supplied 'type' field directly as an index into bp->bs_trace[] without bounds validation. The 'type' field is a 16-bit value extracted from DMA-mapped completion ring memory that the NIC writes directly to host RAM. A malicious or compromised NIC can supply any value from 0 to 65535, causing an out-of-bounds access into kernel heap memory. The bnxt_bs_trace_check_wrap() call then dereferences bs_trace->magic_byte and writes to bs_trace->last_offset and bs_trace->wrapped, leading to kernel memory corruption or a crash. Fix by adding a bounds check and defining BNXT_TRACE_MAX as DBG_LOG_BUFFER_FLUSH_REQ_TYPE_ERR_QPC_TRACE + 1 to cover all currently defined firmware trace types (0x0 through 0xc). Fixes: 84fcd9449fd7 ("bnxt_en: Manage the FW trace context memory") Reported-by: Yuhao Jiang <danisjiang@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/SYBPR01MB7881A253A1C9775D277F30E9AF42A@SYBPR01MB7881.ausprd01.prod.outlook.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 dayslibie: prevent memleak in fwlog codeMichal Swiatkowski
All cmd_buf buffers are allocated and need to be freed after usage. Add an error unwinding path that properly frees these buffers. The memory leak happens whenever fwlog configuration is changed. For example: $echo 256K > /sys/kernel/debug/ixgbe/0000\:32\:00.0/fwlog/log_size Fixes: 96a9a9341cda ("ice: configure FW logging") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
7 daysiavf: fix VLAN filter lost on add/delete racePetr Oros
When iavf_add_vlan() finds an existing filter in IAVF_VLAN_REMOVE state, it transitions the filter to IAVF_VLAN_ACTIVE assuming the pending delete can simply be cancelled. However, there is no guarantee that iavf_del_vlans() has not already processed the delete AQ request and removed the filter from the PF. In that case the filter remains in the driver's list as IAVF_VLAN_ACTIVE but is no longer programmed on the NIC. Since iavf_add_vlans() only picks up filters in IAVF_VLAN_ADD state, the filter is never re-added, and spoof checking drops all traffic for that VLAN. CPU0 CPU1 Workqueue ---- ---- --------- iavf_del_vlan(vlan 100) f->state = REMOVE schedule AQ_DEL_VLAN iavf_add_vlan(vlan 100) f->state = ACTIVE iavf_del_vlans() f is ACTIVE, skip iavf_add_vlans() f is ACTIVE, skip Filter is ACTIVE in driver but absent from NIC. Transition to IAVF_VLAN_ADD instead and schedule IAVF_FLAG_AQ_ADD_VLAN_FILTER so iavf_add_vlans() re-programs the filter. A duplicate add is idempotent on the PF. Fixes: 0c0da0e95105 ("iavf: refactor VLAN filter states") Signed-off-by: Petr Oros <poros@redhat.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
7 daysigc: fix page fault in XDP TX timestamps handlingZdenek Bouska
If an XDP application that requested TX timestamping is shutting down while the link of the interface in use is still up the following kernel splat is reported: [ 883.803618] [ T1554] BUG: unable to handle page fault for address: ffffcfb6200fd008 ... [ 883.803650] [ T1554] Call Trace: [ 883.803652] [ T1554] <TASK> [ 883.803654] [ T1554] igc_ptp_tx_tstamp_event+0xdf/0x160 [igc] [ 883.803660] [ T1554] igc_tsync_interrupt+0x2d5/0x300 [igc] ... During shutdown of the TX ring the xsk_meta pointers are left behind, so that the IRQ handler is trying to touch them. This issue is now being fixed by cleaning up the stale xsk meta data on TX shutdown. TX timestamps on other queues remain unaffected. Fixes: 15fd021bc427 ("igc: Add Tx hardware timestamp request for AF_XDP zero-copy packet") Signed-off-by: Zdenek Bouska <zdenek.bouska@siemens.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Florian Bezdeka <florian.bezdeka@siemens.com> Tested-by: Avigail Dahan <avigailx.dahan@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
7 daysigc: fix missing update of skb->tail in igc_xmit_frame()Kohei Enju
igc_xmit_frame() misses updating skb->tail when the packet size is shorter than the minimum one. Use skb_put_padto() in alignment with other Intel Ethernet drivers. Fixes: 0507ef8a0372 ("igc: Add transmit and receive fastpath and interrupt handlers") Signed-off-by: Kohei Enju <kohei@enjuk.jp> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Avigail Dahan <avigailx.dahan@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
7 daystg3: replace placeholder MAC address with device propertyPaul SAGE
On some systems (e.g. iMac 20,1 with BCM57766), the tg3 driver reads a default placeholder mac address (00:10:18:00:00:00) from the mailbox. The correct value on those systems are stored in the 'local-mac-address' property. This patch, detect the default value and tries to retrieve the correct address from the device_get_mac_address function instead. The patch has been tested on two different systems: - iMac 20,1 (BCM57766) model which use the local-mac-address property - iMac 13,2 (BCM57766) model which can use the mailbox, NVRAM or MAC control registers Tested-by: Rishon Jonathan R <mithicalaviator85@gmail.com> Co-developed-by: Vincent MORVAN <vinc@42.fr> Signed-off-by: Vincent MORVAN <vinc@42.fr> Signed-off-by: Paul SAGE <paul.sage@42.fr> Signed-off-by: Atharva Tiwari <atharvatiwarilinuxdev@gmail.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20260314215432.3589-1-atharvatiwarilinuxdev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 daysnet: airoha: Remove airoha_dev_stop() in airoha_remove()Lorenzo Bianconi
Do not run airoha_dev_stop routine explicitly in airoha_remove() since ndo_stop() callback is already executed by unregister_netdev() in __dev_close_many routine if necessary and, doing so, we will end up causing an underflow in the qdma users atomic counters. Rely on networking subsystem to stop the device removing the airoha_eth module. Fixes: 23020f0493270 ("net: airoha: Introduce ethernet support for EN7581 SoC") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260313-airoha-remove-ndo_stop-remove-net-v2-1-67542c3ceeca@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 daysnet: macb: Reinitialize tx/rx queue pointer registers and rx ring during resumeKevin Hao
On certain platforms, such as AMD Versal boards, the tx/rx queue pointer registers are cleared after suspend, and the rx queue pointer register is also disabled during suspend if WOL is enabled. Previously, we assumed that these registers would be restored by macb_mac_link_up(). However, in commit bf9cf80cab81, macb_init_buffers() was moved from macb_mac_link_up() to macb_open(). Therefore, we should call macb_init_buffers() to reinitialize the tx/rx queue pointer registers during resume. Due to the reset of these two registers, we also need to adjust the tx/rx rings accordingly. The tx ring will be handled by gem_shuffle_tx_rings() in macb_mac_link_up(), so we only need to initialize the rx ring here. Fixes: bf9cf80cab81 ("net: macb: Fix tx/rx malfunction after phy link down and up") Reported-by: Quanyang Wang <quanyang.wang@windriver.com> Signed-off-by: Kevin Hao <haokexin@gmail.com> Tested-by: Quanyang Wang <quanyang.wang@windriver.com> Cc: stable@vger.kernel.org Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260312-macb-versal-v1-2-467647173fa4@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 daysnet: macb: Introduce gem_init_rx_ring()Kevin Hao
Extract the initialization code for the GEM RX ring into a new function. This change will be utilized in a subsequent patch. No functional changes are introduced. Signed-off-by: Kevin Hao <haokexin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260312-macb-versal-v1-1-467647173fa4@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 daysnet: ti: icssg-prueth: Fix memory leak in XDP_DROP for non-zero-copy modeMeghana Malladi
Page recycling was removed from the XDP_DROP path in emac_run_xdp() to avoid conflicts with AF_XDP zero-copy mode, which uses xsk_buff_free() instead. However, this causes a memory leak when running XDP programs that drop packets in non-zero-copy mode (standard page pool mode). The pages are never returned to the page pool, leading to OOM conditions. Fix this by handling cleanup in the caller, emac_rx_packet(). When emac_run_xdp() returns ICSSG_XDP_CONSUMED for XDP_DROP, the caller now recycles the page back to the page pool. The zero-copy path, emac_rx_packet_zc() already handles cleanup correctly with xsk_buff_free(). Fixes: 7a64bb388df3 ("net: ti: icssg-prueth: Add AF_XDP zero copy for RX") Signed-off-by: Meghana Malladi <m-malladi@ti.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260311095441.1691636-1-m-malladi@ti.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 daysnet: mana: fix use-after-free in mana_hwc_destroy_channel() by reordering ↵Dipayaan Roy
teardown A potential race condition exists in mana_hwc_destroy_channel() where hwc->caller_ctx is freed before the HWC's Completion Queue (CQ) and Event Queue (EQ) are destroyed. This allows an in-flight CQ interrupt handler to dereference freed memory, leading to a use-after-free or NULL pointer dereference in mana_hwc_handle_resp(). mana_smc_teardown_hwc() signals the hardware to stop but does not synchronize against IRQ handlers already executing on other CPUs. The IRQ synchronization only happens in mana_hwc_destroy_cq() via mana_gd_destroy_eq() -> mana_gd_deregister_irq(). Since this runs after kfree(hwc->caller_ctx), a concurrent mana_hwc_rx_event_handler() can dereference freed caller_ctx (and rxq->msg_buf) in mana_hwc_handle_resp(). Fix this by reordering teardown to reverse-of-creation order: destroy the TX/RX work queues and CQ/EQ before freeing hwc->caller_ctx. This ensures all in-flight interrupt handlers complete before the memory they access is freed. Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)") Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/abHA3AjNtqa1nx9k@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 daysnet: bcmgenet: increase WoL poll timeoutJustin Chen
Some systems require more than 5ms to get into WoL mode. Increase the timeout value to 50ms. Fixes: c51de7f3976b ("net: bcmgenet: add Wake-on-LAN support code") Signed-off-by: Justin Chen <justin.chen@broadcom.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://patch.msgid.link/20260312191852.3904571-1-justin.chen@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 daysocteontx2-af: devlink: fix NIX RAS reporter to use RAS interrupt statusAlok Tiwari
The NIX RAS health report path uses nix_af_rvu_err when handling the NIX_AF_RVU_RAS case, so the report prints the ERR interrupt status rather than the RAS interrupt status. Use nix_af_rvu_ras for the NIX_AF_RVU_RAS report. Fixes: 5ed66306eab6 ("octeontx2-af: Add devlink health reporters for NIX") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Link: https://patch.msgid.link/20260310184824.1183651-2-alok.a.tiwari@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 daysocteontx2-af: devlink: fix NIX RAS reporter recovery conditionAlok Tiwari
The NIX RAS health reporter recovery routine checks nix_af_rvu_int to decide whether to re-enable NIX_AF_RAS interrupts. This is the RVU interrupt status field and is unrelated to RAS events, so the recovery flow may incorrectly skip re-enabling NIX_AF_RAS interrupts. Check nix_af_rvu_ras instead before writing NIX_AF_RAS_ENA_W1S. Fixes: 5ed66306eab6 ("octeontx2-af: Add devlink health reporters for NIX") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Link: https://patch.msgid.link/20260310184824.1183651-1-alok.a.tiwari@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 daysnet: ethernet: ti: am65-cpsw-nuss: Fix rx_filter value for PTP supportChintan Vankar
The "rx_filter" member of "hwtstamp_config" structure is an enum field and does not support bitwise OR combination of multiple filter values. It causes error while linuxptp application tries to match rx filter version. Fix this by storing the requested filter type in a new port field. Fixes: 97248adb5a3b ("net: ti: am65-cpsw: Update hw timestamping filter for PTPv1 RX packets") Signed-off-by: Chintan Vankar <c-vankar@ti.com> Link: https://patch.msgid.link/20260310160940.109822-1-c-vankar@ti.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 daysnet/mana: Null service_wq on setup error to prevent double destroyShiraz Saleem
In mana_gd_setup() error path, set gc->service_wq to NULL after destroy_workqueue() to match the cleanup in mana_gd_cleanup(). This prevents a use-after-free if the workqueue pointer is checked after a failed setup. Fixes: f975a0955276 ("net: mana: Fix double destroy_workqueue on service rescan PCI path") Signed-off-by: Shiraz Saleem <shirazsaleem@microsoft.com> Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260309172443.688392-1-kotaranov@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 daysMerge branch '100GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2026-03-10 (ice, iavf, i40e, e1000e, e1000) Nikolay Aleksandrov changes return code of RDMA related ice devlink get parameters when irdma is not enabled to -EOPNOTSUPP as current return of -ENODEV causes issues with devlink output. Petr Oros resolves a couple of issues in iavf; freeing PTP resources before reset and disable. Fixing contention issues with the netdev lock between reset and some ethtool operations. Alok Tiwari corrects an incorrect comparison of cloud filter values and adjust some passed arguments to sizeof() for consistency on i40e. Matt Vollrath removes an incorrect decrement for DMA error on e1000 and e1000e drivers. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: e1000/e1000e: Fix leak in DMA error cleanup i40e: fix src IP mask checks and memcpy argument names in cloud filter iavf: fix incorrect reset handling in callbacks iavf: fix PTP use-after-free during reset drivers: net: ice: fix devlink parameters get without irdma ==================== Link: https://patch.msgid.link/20260310205654.4109072-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 daysnet: bcmgenet: fix broken EEE by converting to phylib-managed stateNicolai Buchwitz
The bcmgenet EEE implementation is broken in several ways. phy_support_eee() is never called, so the PHY never advertises EEE and phylib never sets phydev->enable_tx_lpi. bcmgenet_mac_config() checks priv->eee.eee_enabled to decide whether to enable the MAC LPI logic, but that field is never initialised to true, so the MAC never enters Low Power Idle even when EEE is negotiated - wasting the power savings EEE is designed to provide. The only way to get EEE working at all is a manual 'ethtool --set-eee eth0 eee on' after every link-up, and even then bcmgenet_get_eee() immediately clobbers the reported state because phy_ethtool_get_eee() overwrites eee_enabled and tx_lpi_enabled with the uninitialised PHY eee_cfg values. Finally, bcmgenet_mac_config() is only called on link-up, so EEE is never disabled in hardware on link-down. Fix all of this by removing the MAC-side EEE state tracking (priv->eee) and aligning with the pattern used by other non-phylink MAC drivers such as FEC. Call phy_support_eee() in bcmgenet_mii_probe() so the PHY advertises EEE link modes and phylib tracks negotiation state. Move the EEE hardware control to bcmgenet_mii_setup(), which is called on every link event, and drive it directly from phydev->enable_tx_lpi - the flag phylib sets when EEE is negotiated and the user has not disabled it. This enables EEE automatically once the link partner agrees and disables it cleanly on link-down. Make bcmgenet_get_eee() and bcmgenet_set_eee() pure passthroughs to phy_ethtool_get_eee() and phy_ethtool_set_eee(), with the MAC hardware register read/written for tx_lpi_timer. Drop struct ethtool_keee eee from struct bcmgenet_priv. Fixes: fe0d4fd9285e ("net: phy: Keep track of EEE configuration") Link: https://lore.kernel.org/netdev/d352039f-4cbb-41e6-9aeb-0b4f3941b54c@lunn.ch/ Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Tested-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://patch.msgid.link/20260310054935.1238594-1-nb@tipi-net.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 daysnet: ethernet: arc: emac: quiesce interrupts before requesting IRQFan Wu
Normal RX/TX interrupts are enabled later, in arc_emac_open(), so probe should not see interrupt delivery in the usual case. However, hardware may still present stale or latched interrupt status left by firmware or the bootloader. If probe later unwinds after devm_request_irq() has installed the handler, such a stale interrupt can still reach arc_emac_intr() during teardown and race with release of the associated net_device. Avoid that window by putting the device into a known quiescent state before requesting the IRQ: disable all EMAC interrupt sources and clear any pending EMAC interrupt status bits. This keeps the change hardware-focused and minimal, while preventing spurious IRQ delivery from leftover state. Fixes: e4f2379db6c6 ("ethernet/arc/arc_emac - Add new driver") Cc: stable@vger.kernel.org Signed-off-by: Fan Wu <fanwu01@zju.edu.cn> Link: https://patch.msgid.link/20260309132409.584966-1-fanwu01@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
14 daysnet: macb: Shuffle the tx ring before enabling txKevin Hao
Quanyang observed that when using an NFS rootfs on an AMD ZynqMp board, the rootfs may take an extended time to recover after a suspend. Upon investigation, it was determined that the issue originates from a problem in the macb driver. According to the Zynq UltraScale TRM [1], when transmit is disabled, the transmit buffer queue pointer resets to point to the address specified by the transmit buffer queue base address register. In the current implementation, the code merely resets `queue->tx_head` and `queue->tx_tail` to '0'. This approach presents several issues: - Packets already queued in the tx ring are silently lost, leading to memory leaks since the associated skbs cannot be released. - Concurrent write access to `queue->tx_head` and `queue->tx_tail` may occur from `macb_tx_poll()` or `macb_start_xmit()` when these values are reset to '0'. - The transmission may become stuck on a packet that has already been sent out, with its 'TX_USED' bit set, but has not yet been processed. However, due to the manipulation of 'queue->tx_head' and 'queue->tx_tail', `macb_tx_poll()` incorrectly assumes there are no packets to handle because `queue->tx_head == queue->tx_tail`. This issue is only resolved when a new packet is placed at this position. This is the root cause of the prolonged recovery time observed for the NFS root filesystem. To resolve this issue, shuffle the tx ring and tx skb array so that the first unsent packet is positioned at the start of the tx ring. Additionally, ensure that updates to `queue->tx_head` and `queue->tx_tail` are properly protected with the appropriate lock. [1] https://docs.amd.com/v/u/en-US/ug1085-zynq-ultrascale-trm Fixes: bf9cf80cab81 ("net: macb: Fix tx/rx malfunction after phy link down and up") Reported-by: Quanyang Wang <quanyang.wang@windriver.com> Signed-off-by: Kevin Hao <haokexin@gmail.com> Cc: stable@vger.kernel.org Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260307-zynqmp-v2-1-6ef98a70e1d0@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
14 dayse1000/e1000e: Fix leak in DMA error cleanupMatt Vollrath
If an error is encountered while mapping TX buffers, the driver should unmap any buffers already mapped for that skb. Because count is incremented after a successful mapping, it will always match the correct number of unmappings needed when dma_error is reached. Decrementing count before the while loop in dma_error causes an off-by-one error. If any mapping was successful before an unsuccessful mapping, exactly one DMA mapping would leak. In these commits, a faulty while condition caused an infinite loop in dma_error: Commit 03b1320dfcee ("e1000e: remove use of skb_dma_map from e1000e driver") Commit 602c0554d7b0 ("e1000: remove use of skb_dma_map from e1000 driver") Commit c1fa347f20f1 ("e1000/e1000e/igb/igbvf/ixgb/ixgbe: Fix tests of unsigned in *_tx_map()") fixed the infinite loop, but introduced the off-by-one error. This issue may still exist in the igbvf driver, but I did not address it in this patch. Fixes: c1fa347f20f1 ("e1000/e1000e/igb/igbvf/ixgb/ixgbe: Fix tests of unsigned in *_tx_map()") Assisted-by: Claude:claude-4.6-opus Signed-off-by: Matt Vollrath <tactii@gmail.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
14 daysi40e: fix src IP mask checks and memcpy argument names in cloud filterAlok Tiwari
Fix following issues in the IPv4 and IPv6 cloud filter handling logic in both the add and delete paths: - The source-IP mask check incorrectly compares mask.src_ip[0] against tcf.dst_ip[0]. Update it to compare against tcf.src_ip[0]. This likely goes unnoticed because the check is in an "else if" path that only executes when dst_ip is not set, most cloud filter use cases focus on destination-IP matching, and the buggy condition can accidentally evaluate true in some cases. - memcpy() for the IPv4 source address incorrectly uses ARRAY_SIZE(tcf.dst_ip) instead of ARRAY_SIZE(tcf.src_ip), although both arrays are the same size. - The IPv4 memcpy operations used ARRAY_SIZE(tcf.dst_ip) and ARRAY_SIZE (tcf.src_ip), Update these to use sizeof(cfilter->ip.v4.dst_ip) and sizeof(cfilter->ip.v4.src_ip) to ensure correct and explicit copy size. - In the IPv6 delete path, memcmp() uses sizeof(src_ip6) when comparing dst_ip6 fields. Replace this with sizeof(dst_ip6) to make the intent explicit, even though both fields are struct in6_addr. Fixes: e284fc280473 ("i40e: Add and delete cloud filter") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
14 daysiavf: fix incorrect reset handling in callbacksPetr Oros
Three driver callbacks schedule a reset and wait for its completion: ndo_change_mtu(), ethtool set_ringparam(), and ethtool set_channels(). Waiting for reset in ndo_change_mtu() and set_ringparam() was added by commit c2ed2403f12c ("iavf: Wait for reset in callbacks which trigger it") to fix a race condition where adding an interface to bonding immediately after MTU or ring parameter change failed because the interface was still in __RESETTING state. The same commit also added waiting in iavf_set_priv_flags(), which was later removed by commit 53844673d555 ("iavf: kill "legacy-rx" for good"). Waiting in set_channels() was introduced earlier by commit 4e5e6b5d9d13 ("iavf: Fix return of set the new channel count") to ensure the PF has enough time to complete the VF reset when changing channel count, and to return correct error codes to userspace. Commit ef490bbb2267 ("iavf: Add net_shaper_ops support") added net_shaper_ops to iavf, which required reset_task to use _locked NAPI variants (napi_enable_locked, napi_disable_locked) that need the netdev instance lock. Later, commit 7e4d784f5810 ("net: hold netdev instance lock during rtnetlink operations") and commit 2bcf4772e45a ("net: ethtool: try to protect all callback with netdev instance lock") started holding the netdev instance lock during ndo and ethtool callbacks for drivers with net_shaper_ops. Finally, commit 120f28a6f314 ("iavf: get rid of the crit lock") replaced the driver's crit_lock with netdev_lock in reset_task, causing incorrect behavior: the callback holds netdev_lock and waits for reset_task, but reset_task needs the same lock: Thread 1 (callback) Thread 2 (reset_task) ------------------- --------------------- netdev_lock() [blocked on workqueue] ndo_change_mtu() or ethtool op iavf_schedule_reset() iavf_wait_for_reset() iavf_reset_task() waiting... netdev_lock() <- blocked This does not strictly deadlock because iavf_wait_for_reset() uses wait_event_interruptible_timeout() with a 5-second timeout. The wait eventually times out, the callback returns an error to userspace, and after the lock is released reset_task completes the reset. This leads to incorrect behavior: userspace sees an error even though the configuration change silently takes effect after the timeout. Fix this by extracting the reset logic from iavf_reset_task() into a new iavf_reset_step() function that expects netdev_lock to be already held. The three callbacks now call iavf_reset_step() directly instead of scheduling the work and waiting, performing the reset synchronously in the caller's context which already holds netdev_lock. This eliminates both the incorrect error reporting and the need for iavf_wait_for_reset(), which is removed along with the now-unused reset_waitqueue. The workqueue-based iavf_reset_task() becomes a thin wrapper that acquires netdev_lock and calls iavf_reset_step(), preserving its use for PF-initiated resets. The callbacks may block for several seconds while iavf_reset_step() polls hardware registers, but this is acceptable since netdev_lock is a per-device mutex and only serializes operations on the same interface. v3: - Remove netif_running() guard from iavf_set_channels(). Unlike set_ringparam where descriptor counts are picked up by iavf_open() directly, num_req_queues is only consumed during iavf_reinit_interrupt_scheme() in the reset path. Skipping the reset on a down device would silently discard the channel count change. - Remove dead reset_waitqueue code (struct field, init, and all wake_up calls) since iavf_wait_for_reset() was the only consumer. Fixes: 120f28a6f314 ("iavf: get rid of the crit lock") Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
14 daysiavf: fix PTP use-after-free during resetPetr Oros
Commit 7c01dbfc8a1c5f ("iavf: periodically cache PHC time") introduced a worker to cache PHC time, but failed to stop it during reset or disable. This creates a race condition where `iavf_reset_task()` or `iavf_disable_vf()` free adapter resources (AQ) while the worker is still running. If the worker triggers `iavf_queue_ptp_cmd()` during teardown, it accesses freed memory/locks, leading to a crash. Fix this by calling `iavf_ptp_release()` before tearing down the adapter. This ensures `ptp_clock_unregister()` synchronously cancels the worker and cleans up the chardev before the backing resources are destroyed. Fixes: 7c01dbfc8a1c5f ("iavf: periodically cache PHC time") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Ivan Vecera <ivecera@redhat.com> Acked-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
14 daysdrivers: net: ice: fix devlink parameters get without irdmaNikolay Aleksandrov
If CONFIG_IRDMA isn't enabled but there are ice NICs in the system, the driver will prevent full devlink dev param show dump because its rdma get callbacks return ENODEV and stop the dump. For example: $ devlink dev param show pci/0000:82:00.0: name msix_vec_per_pf_max type generic values: cmode driverinit value 2 name msix_vec_per_pf_min type generic values: cmode driverinit value 2 kernel answers: No such device Returning EOPNOTSUPP allows the dump to continue so we can see all devices' devlink parameters. Fixes: c24a65b6a27c ("iidc/ice/irdma: Update IDC to support multiple consumers") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-03-10amd-xgbe: reset PHY settings before starting PHYRaju Rangoju
commit f93505f35745 ("amd-xgbe: let the MAC manage PHY PM") moved xgbe_phy_reset() from xgbe_open() to xgbe_start(), placing it after phy_start(). As a result, the PHY settings were being reset after the PHY had already started. Reorder the calls so that the PHY settings are reset before phy_start() is invoked. Fixes: f93505f35745 ("amd-xgbe: let the MAC manage PHY PM") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Link: https://patch.msgid.link/20260306111629.1515676-4-Raju.Rangoju@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-03-10amd-xgbe: prevent CRC errors during RX adaptation with AN disabledRaju Rangoju
When operating in 10GBASE-KR mode with auto-negotiation disabled and RX adaptation enabled, CRC errors can occur during the RX adaptation process. This happens because the driver continues transmitting and receiving packets while adaptation is in progress. Fix this by stopping TX/RX immediately when the link goes down and RX adaptation needs to be re-triggered, and only re-enabling TX/RX after adaptation completes and the link is confirmed up. Introduce a flag to track whether TX/RX was disabled for adaptation so it can be restored correctly. This prevents packets from being transmitted or received during the RX adaptation window and avoids CRC errors from corrupted frames. The flag tracking the data path state is synchronized with hardware state in xgbe_start() to prevent stale state after device restarts. This ensures that after a restart cycle (where xgbe_stop disables TX/RX and xgbe_start re-enables them), the flag correctly reflects that the data path is active. Fixes: 4f3b20bfbb75 ("amd-xgbe: add support for rx-adaptation") Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Link: https://patch.msgid.link/20260306111629.1515676-3-Raju.Rangoju@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-03-10amd-xgbe: fix link status handling in xgbe_rx_adaptationRaju Rangoju
The link status bit is latched low to allow detection of momentary link drops. If the status indicates that the link is already down, read it again to obtain the current state. Fixes: 4f3b20bfbb75 ("amd-xgbe: add support for rx-adaptation") Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Link: https://patch.msgid.link/20260306111629.1515676-2-Raju.Rangoju@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-03-10net: enetc: do not skip setting LaBCR[MDIO_PHYAD_PRTAD] for addr 0Wei Fang
Given that some platforms may use PHY address 0 (I suppose the PHY may not treat address 0 as a broadcast address or default response address). It is possible for some boards to connect multiple PHYs to the same ENETC MAC, for example: - a PHY with a non-zero address connects to ENETC MAC through SGMII interface (selected via DTS_A) - a PHY with address 0 connects to ENETC MAC through RGMII interface (selected via DTS_B) For the case where the ENETC port MDIO is used to manage the PHY, when switching from DTS_A to DTS_B via soft reboot, LaBCR[MDIO_PHYAD_PRTAD] must be updated to 0 because the NETCMIX block is not reset during soft reboot. However, the current driver explicitly skips configuring address 0, causing LaBCR[MDIO_PHYAD_PRTAD] to retain its old value. Therefore, remove the special-case skip of PHY address 0 so that valid configurations using address 0 are properly supported. Fixes: 6633df05f3ad ("net: enetc: set the external PHY address in IERB for port MDIO usage") Fixes: 50bfd9c06f0f ("net: enetc: set external PHY address in IERB for i.MX94 ENETC") Reviewed-by: Clark Wang <xiaoning.wang@nxp.com> Signed-off-by: Wei Fang <wei.fang@nxp.com> Link: https://patch.msgid.link/20260305031211.904812-3-wei.fang@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-03-10net: enetc: fix incorrect fallback PHY address handlingWei Fang
The current netc_get_phy_addr() implementation falls back to PHY address 0 when the "mdio" node or the PHY child node is missing. On i.MX95, this causes failures when a real PHY is actually assigned address 0 and is managed through the EMDIO interface. Because the bit 0 of phy_mask will be set, leading imx95_enetc_mdio_phyaddr_config() to return an error, and the netc_blk_ctrl driver probe subsequently fails. Fix this by returning -ENODEV when neither an "mdio" node nor any PHY node is present, it means that ENETC port MDIO is not used to manage the PHY, so there is no need to configure LaBCR[MDIO_PHYAD_PRTAD]. Reported-by: Alexander Stein <alexander.stein@ew.tq-group.com> Closes: https://lore.kernel.org/all/7825188.GXAFRqVoOG@steina-w Fixes: 6633df05f3ad ("net: enetc: set the external PHY address in IERB for port MDIO usage") Reviewed-by: Clark Wang <xiaoning.wang@nxp.com> Tested-by: Alexander Stein <alexander.stein@ew.tq-group.com> Signed-off-by: Wei Fang <wei.fang@nxp.com> Link: https://patch.msgid.link/20260305031211.904812-2-wei.fang@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-03-09bnxt_en: Fix RSS table size check when changing ethtool channelsPavan Chebbi
When changing channels, the current check in bnxt_set_channels() is not checking for non-default RSS contexts when the RSS table size changes. The current check for IFF_RXFH_CONFIGURED is only sufficient for the default RSS context. Expand the check to include the presence of any non-default RSS contexts. Allowing such change will result in incorrect configuration of the context's RSS table when the table size changes. Fixes: b3d0083caf9a ("bnxt_en: Support RSS contexts in ethtool .{get|set}_rxfh()") Reported-by: Björn Töpel <bjorn@kernel.org> Link: https://lore.kernel.org/netdev/20260303181535.2671734-1-bjorn@kernel.org/ Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20260306225854.3575672-1-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-06net: spacemit: Fix error handling in emac_tx_mem_map()Vivian Wang
The DMA mappings were leaked on mapping error. Free them with the existing emac_free_tx_buf() function. Fixes: bfec6d7f2001 ("net: spacemit: Add K1 Ethernet MAC") Signed-off-by: Vivian Wang <wangruikang@iscas.ac.cn> Link: https://patch.msgid.link/20260305-k1-ethernet-more-fixes-v2-2-e4e434d65055@iscas.ac.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-06net: spacemit: Fix error handling in emac_alloc_rx_desc_buffers()Vivian Wang
Even if we get a dma_mapping_error() while mapping an RX buffer, we should still update rx_ring->head to ensure that the buffers we were able to allocate and map are used. Fix this by breaking out to the existing code after the loop, analogous to the existing handling for skb allocation failure. Fixes: bfec6d7f2001 ("net: spacemit: Add K1 Ethernet MAC") Signed-off-by: Vivian Wang <wangruikang@iscas.ac.cn> Link: https://patch.msgid.link/20260305-k1-ethernet-more-fixes-v2-1-e4e434d65055@iscas.ac.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-06net/mlx5e: RX, Fix XDP multi-buf frag counting for legacy RQDragos Tatulea
XDP multi-buf programs can modify the layout of the XDP buffer when the program calls bpf_xdp_pull_data() or bpf_xdp_adjust_tail(). The referenced commit in the fixes tag corrected the assumption in the mlx5 driver that the XDP buffer layout doesn't change during a program execution. However, this fix introduced another issue: the dropped fragments still need to be counted on the driver side to avoid page fragment reference counting issues. Such issue can be observed with the test_xdp_native_adjst_tail_shrnk_data selftest when using a payload of 3600 and shrinking by 256 bytes (an upcoming selftest patch): the last fragment gets released by the XDP code but doesn't get tracked by the driver. This results in a negative pp_ref_count during page release and the following splat: WARNING: include/net/page_pool/helpers.h:297 at mlx5e_page_release_fragmented.isra.0+0x4a/0x50 [mlx5_core], CPU#12: ip/3137 Modules linked in: [...] CPU: 12 UID: 0 PID: 3137 Comm: ip Not tainted 6.19.0-rc3+ #12 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 RIP: 0010:mlx5e_page_release_fragmented.isra.0+0x4a/0x50 [mlx5_core] [...] Call Trace: <TASK> mlx5e_dealloc_rx_wqe+0xcb/0x1a0 [mlx5_core] mlx5e_free_rx_descs+0x7f/0x110 [mlx5_core] mlx5e_close_rq+0x50/0x60 [mlx5_core] mlx5e_close_queues+0x36/0x2c0 [mlx5_core] mlx5e_close_channel+0x1c/0x50 [mlx5_core] mlx5e_close_channels+0x45/0x80 [mlx5_core] mlx5e_safe_switch_params+0x1a5/0x230 [mlx5_core] mlx5e_change_mtu+0xf3/0x2f0 [mlx5_core] netif_set_mtu_ext+0xf1/0x230 do_setlink.isra.0+0x219/0x1180 rtnl_newlink+0x79f/0xb60 rtnetlink_rcv_msg+0x213/0x3a0 netlink_rcv_skb+0x48/0xf0 netlink_unicast+0x24a/0x350 netlink_sendmsg+0x1ee/0x410 __sock_sendmsg+0x38/0x60 ____sys_sendmsg+0x232/0x280 ___sys_sendmsg+0x78/0xb0 __sys_sendmsg+0x5f/0xb0 [...] do_syscall_64+0x57/0xc50 This patch fixes the issue by doing page frag counting on all the original XDP buffer fragments for all relevant XDP actions (XDP_TX , XDP_REDIRECT and XDP_PASS). This is basically reverting to the original counting before the commit in the fixes tag. As frag_page is still pointing to the original tail, the nr_frags parameter to xdp_update_skb_frags_info() needs to be calculated in a different way to reflect the new nr_frags. Fixes: afd5ba577c10 ("net/mlx5e: RX, Fix generating skb from non-linear xdp_buff for legacy RQ") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Amery Hung <ameryhung@gmail.com> Link: https://patch.msgid.link/20260305142634.1813208-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-06net/mlx5e: RX, Fix XDP multi-buf frag counting for striding RQDragos Tatulea
XDP multi-buf programs can modify the layout of the XDP buffer when the program calls bpf_xdp_pull_data() or bpf_xdp_adjust_tail(). The referenced commit in the fixes tag corrected the assumption in the mlx5 driver that the XDP buffer layout doesn't change during a program execution. However, this fix introduced another issue: the dropped fragments still need to be counted on the driver side to avoid page fragment reference counting issues. The issue was discovered by the drivers/net/xdp.py selftest, more specifically the test_xdp_native_tx_mb: - The mlx5 driver allocates a page_pool page and initializes it with a frag counter of 64 (pp_ref_count=64) and the internal frag counter to 0. - The test sends one packet with no payload. - On RX (mlx5e_skb_from_cqe_mpwrq_nonlinear()), mlx5 configures the XDP buffer with the packet data starting in the first fragment which is the page mentioned above. - The XDP program runs and calls bpf_xdp_pull_data() which moves the header into the linear part of the XDP buffer. As the packet doesn't contain more data, the program drops the tail fragment since it no longer contains any payload (pp_ref_count=63). - mlx5 device skips counting this fragment. Internal frag counter remains 0. - mlx5 releases all 64 fragments of the page but page pp_ref_count is 63 => negative reference counting error. Resulting splat during the test: WARNING: CPU: 0 PID: 188225 at ./include/net/page_pool/helpers.h:297 mlx5e_page_release_fragmented.isra.0+0xbd/0xe0 [mlx5_core] Modules linked in: [...] CPU: 0 UID: 0 PID: 188225 Comm: ip Not tainted 6.18.0-rc7_for_upstream_min_debug_2025_12_08_11_44 #1 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:mlx5e_page_release_fragmented.isra.0+0xbd/0xe0 [mlx5_core] [...] Call Trace: <TASK> mlx5e_free_rx_mpwqe+0x20a/0x250 [mlx5_core] mlx5e_dealloc_rx_mpwqe+0x37/0xb0 [mlx5_core] mlx5e_free_rx_descs+0x11a/0x170 [mlx5_core] mlx5e_close_rq+0x78/0xa0 [mlx5_core] mlx5e_close_queues+0x46/0x2a0 [mlx5_core] mlx5e_close_channel+0x24/0x90 [mlx5_core] mlx5e_close_channels+0x5d/0xf0 [mlx5_core] mlx5e_safe_switch_params+0x2ec/0x380 [mlx5_core] mlx5e_change_mtu+0x11d/0x490 [mlx5_core] mlx5e_change_nic_mtu+0x19/0x30 [mlx5_core] netif_set_mtu_ext+0xfc/0x240 do_setlink.isra.0+0x226/0x1100 rtnl_newlink+0x7a9/0xba0 rtnetlink_rcv_msg+0x220/0x3c0 netlink_rcv_skb+0x4b/0xf0 netlink_unicast+0x255/0x380 netlink_sendmsg+0x1f3/0x420 __sock_sendmsg+0x38/0x60 ____sys_sendmsg+0x1e8/0x240 ___sys_sendmsg+0x7c/0xb0 [...] __sys_sendmsg+0x5f/0xb0 do_syscall_64+0x55/0xc70 The problem applies for XDP_PASS as well which is handled in a different code path in the driver. This patch fixes the issue by doing page frag counting on all the original XDP buffer fragments for all relevant XDP actions (XDP_TX , XDP_REDIRECT and XDP_PASS). This is basically reverting to the original counting before the commit in the fixes tag. As frag_page is still pointing to the original tail, the nr_frags parameter to xdp_update_skb_frags_info() needs to be calculated in a different way to reflect the new nr_frags. Fixes: 87bcef158ac1 ("net/mlx5e: RX, Fix generating skb from non-linear xdp_buff for striding RQ") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Cc: Amery Hung <ameryhung@gmail.com> Reviewed-by: Nimrod Oren <noren@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260305142634.1813208-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-06net/mlx5e: Fix DMA FIFO desync on error CQE SQ recoveryGal Pressman
In case of a TX error CQE, a recovery flow is triggered, mlx5e_reset_txqsq_cc_pc() resets dma_fifo_cc to 0 but not dma_fifo_pc, desyncing the DMA FIFO producer and consumer. After recovery, the producer pushes new DMA entries at the old dma_fifo_pc, while the consumer reads from position 0. This causes us to unmap stale DMA addresses from before the recovery. The DMA FIFO is a purely software construct with no HW counterpart. At the point of reset, all WQEs have been flushed so dma_fifo_cc is already equal to dma_fifo_pc. There is no need to reset either counter, similar to how skb_fifo pc/cc are untouched. Remove the 'dma_fifo_cc = 0' reset. This fixes the following WARNING: WARNING: CPU: 0 PID: 0 at drivers/iommu/dma-iommu.c:1240 iommu_dma_unmap_page+0x79/0x90 Modules linked in: mlx5_vdpa vringh vdpa bonding mlx5_ib mlx5_vfio_pci ipip mlx5_fwctl tunnel4 mlx5_core ib_ipoib geneve ip6_gre ip_gre gre nf_tables ip6_tunnel rdma_ucm ib_uverbs ib_umad vfio_pci vfio_pci_core act_mirred act_skbedit act_vlan vhost_net vhost tap ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle cls_matchall nfnetlink_cttimeout act_gact cls_flower sch_ingress vhost_iotlb iptable_raw tunnel6 vfio_iommu_type1 vfio openvswitch nsh rpcsec_gss_krb5 auth_rpcgss oid_registry xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat xt_addrtype br_netfilter overlay zram zsmalloc rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core fuse [last unloaded: nf_tables] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.13.0-rc5_for_upstream_min_debug_2024_12_30_21_33 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:iommu_dma_unmap_page+0x79/0x90 Code: 2b 4d 3b 21 72 26 4d 3b 61 08 73 20 49 89 d8 44 89 f9 5b 4c 89 f2 4c 89 e6 48 89 ef 5d 41 5c 41 5d 41 5e 41 5f e9 c7 ae 9e ff <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 66 2e 0f 1f 84 00 00 00 00 Call Trace: <IRQ> ? __warn+0x7d/0x110 ? iommu_dma_unmap_page+0x79/0x90 ? report_bug+0x16d/0x180 ? handle_bug+0x4f/0x90 ? exc_invalid_op+0x14/0x70 ? asm_exc_invalid_op+0x16/0x20 ? iommu_dma_unmap_page+0x79/0x90 ? iommu_dma_unmap_page+0x2e/0x90 dma_unmap_page_attrs+0x10d/0x1b0 mlx5e_tx_wi_dma_unmap+0xbe/0x120 [mlx5_core] mlx5e_poll_tx_cq+0x16d/0x690 [mlx5_core] mlx5e_napi_poll+0x8b/0xac0 [mlx5_core] __napi_poll+0x24/0x190 net_rx_action+0x32a/0x3b0 ? mlx5_eq_comp_int+0x7e/0x270 [mlx5_core] ? notifier_call_chain+0x35/0xa0 handle_softirqs+0xc9/0x270 irq_exit_rcu+0x71/0xd0 common_interrupt+0x7f/0xa0 </IRQ> <TASK> asm_common_interrupt+0x22/0x40 Fixes: db75373c91b0 ("net/mlx5e: Recover Send Queue (SQ) from error state") Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260305142634.1813208-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-06net/mlx5: Fix peer miss rules host disabled checksCarolina Jubran
The check on mlx5_esw_host_functions_enabled(esw->dev) for adding VF peer miss rules is incorrect. These rules match traffic from peer's VFs, so the local device's host function status is irrelevant. Remove this check to ensure peer VF traffic is properly handled regardless of local host configuration. Also fix the PF peer miss rule deletion to be symmetric with the add path, so only attempt to delete the rule if it was actually created. Fixes: 520369ef43a8 ("net/mlx5: Support disabling host PFs") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260305142634.1813208-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-06net/mlx5: Fix crash when moving to switchdev modePatrisious Haddad
When moving to switchdev mode when the device doesn't support IPsec, we try to clean up the IPsec resources anyway which causes the crash below, fix that by correctly checking for IPsec support before trying to clean up its resources. [27642.515799] WARNING: arch/x86/mm/fault.c:1276 at do_user_addr_fault+0x18a/0x680, CPU#4: devlink/6490 [27642.517159] Modules linked in: xt_conntrack xt_MASQUERADE ip6table_nat ip6table_filter ip6_tables iptable_nat nf_nat xt_addrtype rpcsec_gss_krb5 auth_rpcgss oid_registry overlay mlx5_fwctl nfnetlink zram zsmalloc mlx5_ib fuse rpcrdma rdma_ucm ib_uverbs ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_core ib_core [27642.521358] CPU: 4 UID: 0 PID: 6490 Comm: devlink Not tainted 6.19.0-rc5_for_upstream_min_debug_2026_01_14_16_47 #1 NONE [27642.522923] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [27642.524528] RIP: 0010:do_user_addr_fault+0x18a/0x680 [27642.525362] Code: ff 0f 84 75 03 00 00 48 89 ee 4c 89 e7 e8 5e b9 22 00 49 89 c0 48 85 c0 0f 84 a8 02 00 00 f7 c3 60 80 00 00 74 22 31 c9 eb ae <0f> 0b 48 83 c4 10 48 89 ea 48 89 de 4c 89 f7 5b 5d 41 5c 41 5d 41 [27642.528166] RSP: 0018:ffff88810770f6b8 EFLAGS: 00010046 [27642.529038] RAX: 0000000000000000 RBX: 0000000000000002 RCX: ffff88810b980f00 [27642.530158] RDX: 00000000000000a0 RSI: 0000000000000002 RDI: ffff88810770f728 [27642.531270] RBP: 00000000000000a0 R08: 0000000000000000 R09: 0000000000000000 [27642.532383] R10: 0000000000000000 R11: 0000000000000000 R12: ffff888103f3c4c0 [27642.533499] R13: 0000000000000000 R14: ffff88810770f728 R15: 0000000000000000 [27642.534614] FS: 00007f197c741740(0000) GS:ffff88856a94c000(0000) knlGS:0000000000000000 [27642.535915] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [27642.536858] CR2: 00000000000000a0 CR3: 000000011334c003 CR4: 0000000000172eb0 [27642.537982] Call Trace: [27642.538466] <TASK> [27642.538907] exc_page_fault+0x76/0x140 [27642.539583] asm_exc_page_fault+0x22/0x30 [27642.540282] RIP: 0010:_raw_spin_lock_irqsave+0x10/0x30 [27642.541134] Code: 07 85 c0 75 11 ba ff 00 00 00 f0 0f b1 17 75 06 b8 01 00 00 00 c3 31 c0 c3 90 0f 1f 44 00 00 53 9c 5b fa 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 05 48 89 d8 5b c3 89 c6 e8 7e 02 00 00 48 89 d8 5b [27642.543936] RSP: 0018:ffff88810770f7d8 EFLAGS: 00010046 [27642.544803] RAX: 0000000000000000 RBX: 0000000000000202 RCX: ffff888113ad96d8 [27642.545916] RDX: 0000000000000001 RSI: ffff88810770f818 RDI: 00000000000000a0 [27642.547027] RBP: 0000000000000098 R08: 0000000000000400 R09: ffff88810b980f00 [27642.548140] R10: 0000000000000001 R11: ffff888101845a80 R12: 00000000000000a8 [27642.549263] R13: ffffffffa02a9060 R14: 00000000000000a0 R15: ffff8881130d8a40 [27642.550379] complete_all+0x20/0x90 [27642.551010] mlx5e_ipsec_disable_events+0xb6/0xf0 [mlx5_core] [27642.552022] mlx5e_nic_disable+0x12d/0x220 [mlx5_core] [27642.552929] mlx5e_detach_netdev+0x66/0xf0 [mlx5_core] [27642.553822] mlx5e_netdev_change_profile+0x5b/0x120 [mlx5_core] [27642.554821] mlx5e_vport_rep_load+0x419/0x590 [mlx5_core] [27642.555757] ? xa_load+0x53/0x90 [27642.556361] __esw_offloads_load_rep+0x54/0x70 [mlx5_core] [27642.557328] mlx5_esw_offloads_rep_load+0x45/0xd0 [mlx5_core] [27642.558320] esw_offloads_enable+0xb4b/0xc90 [mlx5_core] [27642.559247] mlx5_eswitch_enable_locked+0x34e/0x4f0 [mlx5_core] [27642.560257] ? mlx5_rescan_drivers_locked+0x222/0x2d0 [mlx5_core] [27642.561284] mlx5_devlink_eswitch_mode_set+0x5ac/0x9c0 [mlx5_core] [27642.562334] ? devlink_rate_set_ops_supported+0x21/0x3a0 [27642.563220] devlink_nl_eswitch_set_doit+0x67/0xe0 [27642.564026] genl_family_rcv_msg_doit+0xe0/0x130 [27642.564816] genl_rcv_msg+0x183/0x290 [27642.565466] ? __devlink_nl_pre_doit.isra.0+0x160/0x160 [27642.566329] ? devlink_nl_eswitch_get_doit+0x290/0x290 [27642.567181] ? devlink_nl_pre_doit_parent_dev_optional+0x20/0x20 [27642.568147] ? genl_family_rcv_msg_dumpit+0xf0/0xf0 [27642.568966] netlink_rcv_skb+0x4b/0xf0 [27642.569629] genl_rcv+0x24/0x40 [27642.570215] netlink_unicast+0x255/0x380 [27642.570901] ? __alloc_skb+0xfa/0x1e0 [27642.571560] netlink_sendmsg+0x1f3/0x420 [27642.572249] __sock_sendmsg+0x38/0x60 [27642.572911] __sys_sendto+0x119/0x180 [27642.573561] ? __sys_recvmsg+0x5c/0xb0 [27642.574227] __x64_sys_sendto+0x20/0x30 [27642.574904] do_syscall_64+0x55/0xc10 [27642.575554] entry_SYSCALL_64_after_hwframe+0x4b/0x53 [27642.576391] RIP: 0033:0x7f197c85e807 [27642.577050] Code: c7 c0 ff ff ff ff eb be 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 80 3d 45 08 0d 00 00 41 89 ca 74 10 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 69 c3 55 48 89 e5 53 48 83 ec 38 44 89 4d d0 [27642.579846] RSP: 002b:00007ffebd4e2248 EFLAGS: 00000202 ORIG_RAX: 000000000000002c [27642.581082] RAX: ffffffffffffffda RBX: 000055cfcd9cd2a0 RCX: 00007f197c85e807 [27642.582200] RDX: 0000000000000038 RSI: 000055cfcd9cd490 RDI: 0000000000000003 [27642.583320] RBP: 00007ffebd4e2290 R08: 00007f197c942200 R09: 000000000000000c [27642.584437] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [27642.585555] R13: 000055cfcd9cd490 R14: 00007ffebd4e45d1 R15: 000055cfcd9cd2a0 [27642.586671] </TASK> [27642.587121] ---[ end trace 0000000000000000 ]--- [27642.587910] BUG: kernel NULL pointer dereference, address: 00000000000000a0 Fixes: 664f76be38a1 ("net/mlx5: Fix IPsec cleanup over MPV device") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260305142634.1813208-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-06net/mlx5: Fix deadlock between devlink lock and esw->wqCosmin Ratiu
esw->work_queue executes esw_functions_changed_event_handler -> esw_vfs_changed_event_handler and acquires the devlink lock. .eswitch_mode_set (acquires devlink lock in devlink_nl_pre_doit) -> mlx5_devlink_eswitch_mode_set -> mlx5_eswitch_disable_locked -> mlx5_eswitch_event_handler_unregister -> flush_workqueue deadlocks when esw_vfs_changed_event_handler executes. Fix that by no longer flushing the work to avoid the deadlock, and using a generation counter to keep track of work relevance. This avoids an old handler manipulating an esw that has undergone one or more mode changes: - the counter is incremented in mlx5_eswitch_event_handler_unregister. - the counter is read and passed to the ephemeral mlx5_host_work struct. - the work handler takes the devlink lock and bails out if the current generation is different than the one it was scheduled to operate on. - mlx5_eswitch_cleanup does the final draining before destroying the wq. No longer flushing the workqueue has the side effect of maybe no longer cancelling pending vport_change_handler work items, but that's ok since those are disabled elsewhere: - mlx5_eswitch_disable_locked disables the vport eq notifier. - mlx5_esw_vport_disable disarms the HW EQ notification and marks vport->enabled under state_lock to false to prevent pending vport handler from doing anything. - mlx5_eswitch_cleanup destroys the workqueue and makes sure all events are disabled/finished. Fixes: f1bc646c9a06 ("net/mlx5: Use devl_ API in mlx5_esw_offloads_devlink_port_register") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260305081019.1811100-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-05net: enetc: use truesize as XDP RxQ info frag_sizeLarysa Zaremba
The only user of frag_size field in XDP RxQ info is bpf_xdp_frags_increase_tail(). It clearly expects truesize instead of DMA write size. Different assumptions in enetc driver configuration lead to negative tailroom. Set frag_size to the same value as frame_sz. Fixes: 2768b2e2f7d2 ("net: enetc: register XDP RX queues with frag_size") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://patch.msgid.link/20260305111253.2317394-9-larysa.zaremba@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-05libeth, idpf: use truesize as XDP RxQ info frag_sizeLarysa Zaremba
The only user of frag_size field in XDP RxQ info is bpf_xdp_frags_increase_tail(). It clearly expects whole buffer size instead of DMA write size. Different assumptions in idpf driver configuration lead to negative tailroom. To make it worse, buffer sizes are not actually uniform in idpf when splitq is enabled, as there are several buffer queues, so rxq->rx_buf_size is meaningless in this case. Use truesize of the first bufq in AF_XDP ZC, as there is only one. Disable growing tail for regular splitq. Fixes: ac8a861f632e ("idpf: prepare structures to support XDP") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://patch.msgid.link/20260305111253.2317394-8-larysa.zaremba@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-05i40e: use xdp.frame_sz as XDP RxQ info frag_sizeLarysa Zaremba
The only user of frag_size field in XDP RxQ info is bpf_xdp_frags_increase_tail(). It clearly expects whole buffer size instead of DMA write size. Different assumptions in i40e driver configuration lead to negative tailroom. Set frag_size to the same value as frame_sz in shared pages mode, use new helper to set frag_size when AF_XDP ZC is active. Fixes: a045d2f2d03d ("i40e: set xdp_rxq_info::frag_size") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://patch.msgid.link/20260305111253.2317394-7-larysa.zaremba@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-05i40e: fix registering XDP RxQ infoLarysa Zaremba
Current way of handling XDP RxQ info in i40e has a problem, where frag_size is not updated when xsk_buff_pool is detached or when MTU is changed, this leads to growing tail always failing for multi-buffer packets. Couple XDP RxQ info registering with buffer allocations and unregistering with cleaning the ring. Fixes: a045d2f2d03d ("i40e: set xdp_rxq_info::frag_size") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://patch.msgid.link/20260305111253.2317394-6-larysa.zaremba@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>