summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-11-24RDMA/hns: Add delayed work for bondingJunxian Huang
When conditions are met, schedule a delayed work in bond event handler to perform bonding operation according to the bond state. In the case of changing slave number or link state, re-set the netdev for the bond ibdev after the modification is complete, since these two operations may not call hns_roce_set_bond_netdev() in hns_roce_init(). The delayed work will be paused when there is a driver reset or exit to avoid concurrency. Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20251112093510.3696363-7-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/hns: Implement bonding init/uninit processJunxian Huang
Implement hns_roce_slave_init() and hns_roce_slave_uninit() for device init/uninit in bonding cases. The former is used to initialize a slave ibdev (when the slave is unlinked from a bond) or a bond ibdev, while the latter does the opposite. Most of the process is the same as regular device init/uninit, while some bonding‑specific steps below are also added. In bond device init flow, choose one slave to re-initialize as the main_hr_dev of the bond, and it will be the only device presented for multiple slaves. During registration, set and active netdev to the ibdev based on the link state of the slaves. When this main_hr_dev slave is being unlinked while the bond is still valid, choose a new slave from the rest and initialize it as the new bond device. In uninit flow, add a bond cleanup process, restore all the other slaves and clean up bond resource. This is only for the case where the port of main_hr_dev is directly removed without unlinking it from bond. Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20251112093510.3696363-6-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/hns: Add bonding cmdsJunxian Huang
Add three bonding cmds to configure bonding settings to HW. Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20251112093510.3696363-5-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/hns: Add bonding event handlerJunxian Huang
Register netdev notifier for two bonding events NETDEV_CHANGEUPPER and NETDEV_CHANGELOWERSTATE. In NETDEV_CHANGEUPPER event handler, check some rules about the HW constraints when trying to link a new slave to the masteri, and store some bonding information from the notifier. In unlinking case, simply check the number of the rest slaves to decide whether the bond is still supported. In NETDEV_CHANGELOWERSTATE event handler, not much is done. It simply sets the bond state when the bond is ready, which will be used later. Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20251112093510.3696363-4-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/hns: Initialize bonding resourcesJunxian Huang
Allocate bond_grp resources for each card when the first device in this card is registered. Block the initialization of VF when its PF is a bonded slave, as VF is not supported in this case due to HW constraints. Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20251112093510.3696363-3-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/hns: Add helpers to obtain netdev and bus_num from hr_devJunxian Huang
Add helpers to obtain netdev and bus_num from hr_dev. Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20251112093510.3696363-2-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/bng_re: Initialize the Firmware and HardwareSiva Reddy Kallam
Initialize the firmware and hardware with HWRM command. Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Link: https://patch.msgid.link/20251117171136.128193-9-siva.kallam@broadcom.com Reviewed-by: Usman Ansari <usman.ansari@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/bng_re: Add basic debugfs infrastructureSiva Reddy Kallam
Add basic debugfs infrastructure for Broadcom next generation controller. Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Link: https://patch.msgid.link/20251117171136.128193-8-siva.kallam@broadcom.com Reviewed-by: Usman Ansari <usman.ansari@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/bng_re: Enable Firmware channel and query device attributesSiva Reddy Kallam
Enable Firmware channel and query device attributes Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Link: https://patch.msgid.link/20251117171136.128193-7-siva.kallam@broadcom.com Reviewed-by: Usman Ansari <usman.ansari@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/bng_re: Add infrastructure for enabling Firmware channelSiva Reddy Kallam
Add infrastructure for enabling Firmware channel. Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Link: https://patch.msgid.link/20251117171136.128193-6-siva.kallam@broadcom.com Reviewed-by: Usman Ansari <usman.ansari@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/bng_re: Allocate required memory resources for Firmware channelSiva Reddy Kallam
Allocate required memory resources for Firmware channel. Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Link: https://patch.msgid.link/20251117171136.128193-5-siva.kallam@broadcom.com Reviewed-by: Usman Ansari <usman.ansari@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/bng_re: Register and get the resources from bnge driverSiva Reddy Kallam
Register and get the basic required resources from bnge driver. Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Link: https://patch.msgid.link/20251117171136.128193-4-siva.kallam@broadcom.com Reviewed-by: Usman Ansari <usman.ansari@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24RDMA/bng_re: Add Auxiliary interfaceSiva Reddy Kallam
Add basic Auxiliary interface to the driver which supports the BCM5770X NIC family. Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Link: https://patch.msgid.link/20251117171136.128193-3-siva.kallam@broadcom.com Reviewed-by: Usman Ansari <usman.ansari@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-20bng_en: Add RoCE aux device supportVikas Gupta
Add an auxiliary (aux) device to support RoCE. The base driver is responsible for creating the auxiliary device and allocating the required resources to it, which will be owned by the bnge RoCE driver in future patches. Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com> Link: https://patch.msgid.link/20251117171136.128193-2-siva.kallam@broadcom.com Reviewed-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-20RDMA/bnxt_re: Fix wrong check for CQ coalesc supportKalesh AP
Driver is not creating the debugfs hooks for CQ coalesc parameters because of a wrong check. Fixed the condition check inside bnxt_re_init_cq_coal_debugfs(). Fixes: cf2749079011 ("RDMA/bnxt_re: Add a debugfs entry for CQE coalescing tuning") Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20251117061306.1140588-1-kalesh-anakkur.purayil@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-13RDMA/core: Prevent soft lockup during large user memory region cleanupLi RongQing
When a process exits with numerous large, pinned memory regions consisting of 4KB pages, the cleanup of the memory region through __ib_umem_release() may cause soft lockups. This is because unpin_user_page_range_dirty_lock() is called in a tight loop for unpin and releasing page without yielding the CPU. watchdog: BUG: soft lockup - CPU#44 stuck for 26s! [python3:73464] Kernel panic - not syncing: softlockup: hung tasks CPU: 44 PID: 73464 Comm: python3 Tainted: G OEL asm_sysvec_apic_timer_interrupt+0x1b/0x20 RIP: 0010:free_unref_page+0xff/0x190 ? free_unref_page+0xe3/0x190 __put_page+0x77/0xe0 put_compound_head+0xed/0x100 unpin_user_page_range_dirty_lock+0xb2/0x180 __ib_umem_release+0x57/0xb0 [ib_core] ib_umem_release+0x3f/0xd0 [ib_core] mlx5_ib_dereg_mr+0x2e9/0x440 [mlx5_ib] ib_dereg_mr_user+0x43/0xb0 [ib_core] uverbs_free_mr+0x15/0x20 [ib_uverbs] destroy_hw_idr_uobject+0x21/0x60 [ib_uverbs] uverbs_destroy_uobject+0x38/0x1b0 [ib_uverbs] __uverbs_cleanup_ufile+0xd1/0x150 [ib_uverbs] uverbs_destroy_ufile_hw+0x3f/0x100 [ib_uverbs] ib_uverbs_close+0x1f/0xb0 [ib_uverbs] __fput+0x9c/0x280 ____fput+0xe/0x20 task_work_run+0x6a/0xb0 do_exit+0x217/0x3c0 do_group_exit+0x3b/0xb0 get_signal+0x150/0x900 arch_do_signal_or_restart+0xde/0x100 exit_to_user_mode_loop+0xc4/0x160 exit_to_user_mode_prepare+0xa0/0xb0 syscall_exit_to_user_mode+0x27/0x50 do_syscall_64+0x63/0xb0 Fix soft lockup issues by incorporating cond_resched() calls within __ib_umem_release(), and this SG entries are typically grouped in 2MB chunks on x86_64, adding cond_resched() should has minimal performance impact. Signed-off-by: Li RongQing <lirongqing@baidu.com> Link: https://patch.msgid.link/20251113095317.2628-1-lirongqing@baidu.com Acked-by: Junxian Huang <huangjunxian6@hisilicon.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-13RDMA/restrack: Fix typos in the commentsKalesh AP
Fix couple of occurrences of the misspelled word "reource" in the comments with the correct spelling "resource". Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20251113105457.879903-1-kalesh-anakkur.purayil@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-12RDMA/irdma: Remove redundant NULL check of udata in irdma_create_user_ah()Tuo Li
The variable udata cannot be NULL because irdma_create_user_ah() always receives it. Therefore, the if() check can be safely removed. Signed-off-by: Tuo Li <islituo@gmail.com> Link: https://patch.msgid.link/20251112120253.68945-1-islituo@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-12RDMA/cm: Correct typedef and bad line warningsRandy Dunlap
In include/rdma/ib_cm.h: Correct a typedef's kernel-doc notation by adding the 'typedef' keyword to it to avoid a warning. Add a leading " *" to a kernel-doc line to avoid a warning. Warning: ib_cm.h:289 function parameter 'ib_cm_handler' not described in 'int' Warning: ib_cm.h:289 expecting prototype for ib_cm_handler(). Prototype was for int() instead Warning: ib_cm.h:484 bad line: connection message in case duplicates are received. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20251112062908.2711007-1-rdunlap@infradead.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-12Expose definition for 1600Gbps link modeLeon Romanovsky
Single patch to expose new link mode for 1600Gbps, utilizing 8 lanes at 200Gbps per lane. Signed-off-by: Leon Romanovsky <leon@kernel.org> * mlx5-next: net/mlx5: Expose definition for 1600Gbps link mode
2025-11-12net/mlx5: Expose definition for 1600Gbps link modeTariq Toukan
This patch exposes new link mode for 1600Gbps, utilizing 8 lanes at 200Gbps per lane. Co-developed-by: Yael Chemla <ychemla@nvidia.com> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1762863888-1092798-1-git-send-email-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-10RDMA/rtrs: server: Fix error handling in get_or_create_srvMa Ke
After device_initialize() is called, use put_device() to release the device according to kernel device management rules. While direct kfree() work in this case, using put_device() is more correct. Found by code review. Fixes: 9cb837480424 ("RDMA/rtrs: server: main functionality") Signed-off-by: Ma Ke <make24@iscas.ac.cn> Link: https://patch.msgid.link/20251110005158.13394-1-make24@iscas.ac.cn Acked-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09IB/isert: add WQ_PERCPU to alloc_workqueue usersMarco Crivellari
Currently if a user enqueues a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistency cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This continues the effort to refactor workqueue APIs, which began with the introduction of new workqueues and a new alloc_workqueue flag in: commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq") commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag") This change adds a new WQ_PERCPU flag to explicitly request alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://patch.msgid.link/20251107133626.190952-1-marco.crivellari@suse.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09IB/iser: add WQ_PERCPU to alloc_workqueue usersMarco Crivellari
Currently if a user enqueues a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistency cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This continues the effort to refactor workqueue APIs, which began with the introduction of new workqueues and a new alloc_workqueue flag in: commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq") commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag") This change adds a new WQ_PERCPU flag to explicitly request alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://patch.msgid.link/20251107133306.187939-1-marco.crivellari@suse.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09RDMA/irdma: Remove unused CQ registryJacob Moroni
The CQ registry was never actually used (ceq->reg_cq was always NULL), so remove the dead code. Signed-off-by: Jacob Moroni <jmoroni@google.com> Link: https://patch.msgid.link/20251105162841.31786-1-jmoroni@google.com Acked-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-11-09RDMA/mlx5: Add other eswitch support to userspace tablesPatrisious Haddad
Allows the creation of RDMA TRANSPORT tables over VFs/SFs that belong to another eswitch manager. Which is only possible for PFs that were connected via a create_lag PRM command. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-7-98bb707b5d57@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09RDMA/mlx5: Refactor _get_prio() functionPatrisious Haddad
Refactor the _get_prio() function to remove redundant arguments by reusing the existing flow table attributes struct instead of passing attributes separately. This improves code clarity and maintainability. In addition allows downstream patch to add new parameter without needing to change __get_prio() arguments. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-6-98bb707b5d57@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09RDMA/mlx5: Add other_eswitch support for devx destructionPatrisious Haddad
When building a devx object destruction command for steering objects add consideration for other_eswitch argument to allow proper destruction for objects that were created with it. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-5-98bb707b5d57@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09RDMA/mlx5: Change default device for LAG slaves in RDMA TRANSPORT namespacesPatrisious Haddad
In case of a LAG configuration change the root namespace core device for all of the LAG slaves to be the core device of the master device for RDMA_TRANSPORT namespaces, in order to ensure all tables are created through the master device. Once the LAG is disabled revert back to the native core device. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-4-98bb707b5d57@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09Add other eswitch supportLeon Romanovsky
When the device in switchdev mode, the RDMA device manages all the vports which belong to its representors, which can lead to a situation where the PF that is used to manage the RDMA device isn't the native PF of some of the vports it manages. Add infrastructure to allow the master PF to manage all the hardware resources for the vports under its management. Whereas currently the only such resource is RDMA TRANSPORT steering domains. That is done by adding new FW argument other_eswitch which is passed by the driver to the FW to allow the master PF to properly manage vports belonging to other native PF. Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09net/mlx5: fs, set non default device per namespacePatrisious Haddad
Add mlx5_fs_set_root_dev() function which swaps the root namespace core device with another one for a given table_type. It is intended for usage only by RDMA_TRANSPORT tables in case of LAG configuration, to allow the creation of tables during LAG always through the LAG master device, which is valid since during LAG the master is allowed to manage the RDMA_TRANSPORT tables of its slaves. In addition move the table_type enum to global include to allow its use in a downstream patch in the RDMA driver. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-3-98bb707b5d57@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09net/mlx5: fs, Add other_eswitch support for steering tablesPatrisious Haddad
Add other_eswitch support which allows flow tables creation above vports that reside on different esw managers. The new flag MLX5_FLOW_TABLE_OTHER_ESWITCH indicates if the esw_owner_vhca_id attribute is supported. Note that this is only supported if the Advanced-RDMA cap- rdma_transport_manager_other_eswitch is set. And it is the caller responsibility to check that. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-2-98bb707b5d57@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09net/mlx5: Add OTHER_ESWITCH HW capabilitiesPatrisious Haddad
Add OTHER_ESWITCH capabilities which includes other_eswitch and eswitch_owner_vhca_id to all steering objects. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-1-98bb707b5d57@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09net/mlx5: Add direct ST mode support for RDMAYishai Hadas
Add support for direct ST mode where ST Table Location equals PCI_TPH_LOC_NONE. In that case, no steering table exists, the steering tag itself will be used directly by the SW, FW, HW from the mkey. This enables RDMA users to use the current exposed APIs to work in direct mode. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20251027-st-direct-mode-v1-2-e0ad953866b6@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09PCI/TPH: Expose pcie_tph_get_st_table_loc()Yishai Hadas
Expose pcie_tph_get_st_table_loc() to be used by drivers as will be done in the next patch from the series. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20251027-st-direct-mode-v1-1-e0ad953866b6@nvidia.com Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09RDMA/bnxt_re: Add a debugfs entry for CQE coalescing tuningKalesh AP
This patch adds debugfs interfaces that allows the user to enable/disable the RoCE CQ coalescing and fine tune certain CQ coalescing parameters which would be helpful during debug. Signed-off-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Reviewed-by: Hongguang Gao <hongguang.gao@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20251103043425.234846-1-kalesh-anakkur.purayil@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06IB/rdmavt: rdmavt_qp.h: clean up kernel-doc commentsRandy Dunlap
Correct the kernel-doc comments format to avoid around 35 kernel-doc warnings: - use struct keyword to introduce struct kernel-doc comments - use correct variable name for some struct members - use correct function name in comments for some functions - fix spelling in a few comments - use a ':' instead of '-' to separate struct members from their descriptions - add a function name heading for rvt_div_mtu() This leaves one struct member that is not described: rdmavt_qp.h:206: warning: Function parameter or struct member 'wq' not described in 'rvt_krwq' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20251105045127.106822-1-rdunlap@infradead.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06IB/rdmavt: WQ_PERCPU added to alloc_workqueue usersMarco Crivellari
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This change adds a new WQ_PERCPU flag to explicitly request alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. CC: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://patch.msgid.link/20251101163121.78400-6-marco.crivellari@suse.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06RDMA/mlx4: WQ_PERCPU added to alloc_workqueue usersMarco Crivellari
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This change adds a new WQ_PERCPU flag to explicitly request alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. CC: Yishai Hadas <yishaih@nvidia.com> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://patch.msgid.link/20251101163121.78400-5-marco.crivellari@suse.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06hfi1: WQ_PERCPU added to alloc_workqueue usersMarco Crivellari
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This change adds a new WQ_PERCPU flag to explicitly request alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. CC: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://patch.msgid.link/20251101163121.78400-4-marco.crivellari@suse.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06RDMA/core: WQ_PERCPU added to alloc_workqueue usersMarco Crivellari
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This change adds a new WQ_PERCPU flag to explicitly request alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://patch.msgid.link/20251101163121.78400-3-marco.crivellari@suse.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06RDMA/core: RDMA/mlx5: replace use of system_unbound_wq with system_dfl_wqMarco Crivellari
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistency cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://patch.msgid.link/20251101163121.78400-2-marco.crivellari@suse.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06RDMA/irdma: Take a lock before moving SRQ tail in poll_cqJay Bhat
Need to take an SRQ lock in poll_cq before moving SRQ tail. Signed-off-by: Jay Bhat <jay.bhat@intel.com> Reviewed-by: Jacob Moroni <jmoroni@google.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Link: https://patch.msgid.link/20251031021726.1003-7-tatyana.e.nikolova@intel.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-02RDMA/irdma: CQ size and shadow update changes for GEN3Jay Bhat
CQ shadow area should not be updated at the end of a page (once every 64th CQ entry), except when CQ has no more CQEs. SW must also increase the requested CQ size by 1 and make sure the CQ is not exactly one page in size. This is to address a quirk in the hardware. Signed-off-by: Jay Bhat <jay.bhat@intel.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Link: https://patch.msgid.link/20251031021726.1003-4-tatyana.e.nikolova@intel.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-02RDMA/irdma: Silently consume unsignaled completionsJay Bhat
In case we get an unsignaled error completion, we silently consume the CQE by pretending the QP does not exist. Without this, bookkeeping for signaled completions does not work correctly. Signed-off-by: Jay Bhat <jay.bhat@intel.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Link: https://patch.msgid.link/20251031021726.1003-5-tatyana.e.nikolova@intel.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-02RDMA/irdma: Initialize cqp_cmds_info to prevent resource leaksJay Bhat
Failure to initialize info.create field to false in certain cases was resulting in incorrect status code going to rdma-core when dereg_mr failed during reset. To fix this, memset entire cqp_request->info in irdma_alloc_and_get_cqp_request() function, so that this is not spread all over the code. Signed-off-by: Bhat, Jay <jay.bhat@intel.com> Reviewed-by: Jacob Moroni <jmoroni@google.com> Signed-off-by: Krzysztof Czurylo <krzysztof.czurylo@intel.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Link: https://patch.msgid.link/20251031021726.1003-2-tatyana.e.nikolova@intel.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-02RDMA/irdma: Enforce local fence for LOCAL_INV WRsJacob Moroni
Enforce local fence for LOCAL_INV WRs to avoid spurious FASTREG_VALID_MKEY async events during heavy invalidation/registration activity. Signed-off-by: Jacob Moroni <jmoroni@google.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Link: https://patch.msgid.link/20251031021726.1003-3-tatyana.e.nikolova@intel.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-28RDMA/rxe: Fix null deref on srq->rq.queue after resize failureZhu Yanjun
A NULL pointer dereference can occur in rxe_srq_chk_attr() when ibv_modify_srq() is invoked twice in succession under certain error conditions. The first call may fail in rxe_queue_resize(), which leads rxe_srq_from_attr() to set srq->rq.queue = NULL. The second call then triggers a crash (null deref) when accessing srq->rq.queue->buf->index_mask. Call Trace: <TASK> rxe_modify_srq+0x170/0x480 [rdma_rxe] ? __pfx_rxe_modify_srq+0x10/0x10 [rdma_rxe] ? uverbs_try_lock_object+0x4f/0xa0 [ib_uverbs] ? rdma_lookup_get_uobject+0x1f0/0x380 [ib_uverbs] ib_uverbs_modify_srq+0x204/0x290 [ib_uverbs] ? __pfx_ib_uverbs_modify_srq+0x10/0x10 [ib_uverbs] ? tryinc_node_nr_active+0xe6/0x150 ? uverbs_fill_udata+0xed/0x4f0 [ib_uverbs] ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x2c0/0x470 [ib_uverbs] ? __pfx_ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x10/0x10 [ib_uverbs] ? uverbs_fill_udata+0xed/0x4f0 [ib_uverbs] ib_uverbs_run_method+0x55a/0x6e0 [ib_uverbs] ? __pfx_ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x10/0x10 [ib_uverbs] ib_uverbs_cmd_verbs+0x54d/0x800 [ib_uverbs] ? __pfx_ib_uverbs_cmd_verbs+0x10/0x10 [ib_uverbs] ? __pfx___raw_spin_lock_irqsave+0x10/0x10 ? __pfx_do_vfs_ioctl+0x10/0x10 ? ioctl_has_perm.constprop.0.isra.0+0x2c7/0x4c0 ? __pfx_ioctl_has_perm.constprop.0.isra.0+0x10/0x10 ib_uverbs_ioctl+0x13e/0x220 [ib_uverbs] ? __pfx_ib_uverbs_ioctl+0x10/0x10 [ib_uverbs] __x64_sys_ioctl+0x138/0x1c0 do_syscall_64+0x82/0x250 ? fdget_pos+0x58/0x4c0 ? ksys_write+0xf3/0x1c0 ? __pfx_ksys_write+0x10/0x10 ? do_syscall_64+0xc8/0x250 ? __pfx_vm_mmap_pgoff+0x10/0x10 ? fget+0x173/0x230 ? fput+0x2a/0x80 ? ksys_mmap_pgoff+0x224/0x4c0 ? do_syscall_64+0xc8/0x250 ? do_user_addr_fault+0x37b/0xfe0 ? clear_bhb_loop+0x50/0xa0 ? clear_bhb_loop+0x50/0xa0 ? clear_bhb_loop+0x50/0xa0 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: 8700e3e7c485 ("Soft RoCE driver") Tested-by: Liu Yi <asatsuyu.liu@gmail.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Link: https://patch.msgid.link/20251027215203.1321-1-yanjun.zhu@linux.dev Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-27RDMA/cm: Base cm_id destruction timeout on CMA valuesHåkon Bugge
When a GSI MAD packet is sent on the QP, it will potentially be retried CMA_MAX_CM_RETRIES times with a timeout value of: 4.096usec * 2 ^ CMA_CM_RESPONSE_TIMEOUT The above equates to ~64 seconds using the default CMA values. The cm_id_priv's refcount will be incremented for this period. Therefore, the timeout value waiting for a cm_id destruction must be based on the effective timeout of MAD packets. To provide additional leeway, we add 25% to this timeout and use that instead of the constant 10 seconds timeout, which may result in false negatives. Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Link: https://patch.msgid.link/20251021132738.4179604-1-haakon.bugge@oracle.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-24{rdma,net}/mlx5: Query vports mac address from deviceAdithya Jayachandran
Before this patch during either switchdev or legacy mode enablement we cleared the mac address of vports between changes. This change allows us to preserve the vports mac address between eswitch mode changes. Vports hold information for VFs/SFs such as the permanent mac address. VF/SF mac can be set either by iproute vf interface or devlink function interface. For no obvious reason we reset it to 0 on switchdev/legacy mode changes, this patch is fixing that, to align with other vport information that are never reset, e.g GUID,mtu,promisc mode, etc .. Signed-off-by: Adithya Jayachandran <ajayachandra@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Acked-by: Leon Romanovsky <leon@kernel.org> # RDMA