linux-toradex.git - Linux kernel for Apalis and Colibri modules

Age	Commit message (Collapse)	Author
2025-12-04	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma	Linus Torvalds
	Pull rdma updates from Jason Gunthorpe: "This has another new RDMA driver 'bng_en' for latest generation Broadcom NICs. There might be one more new driver still to come. Otherwise it is a fairly quite cycle. Summary: - Minor driver bug fixes and updates to cxgb4, rxe, rdmavt, bnxt_re, mlx5 - Many bug fix patches for irdma - WQ_PERCPU annotations and system_dfl_wq changes - Improved mlx5 support for "other eswitches" and multiple PFs - 1600Gbps link speed reporting support. Four Digits Now! - New driver bng_en for latest generation Broadcom NICs - Bonding support for hns - Adjust mlx5's hmm based ODP to work with the very large address space created by the new 5 level paging default on x86 - Lockdep fixups in rxe and siw" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (65 commits) RDMA/rxe: reclassify sockets in order to avoid false positives from lockdep RDMA/siw: reclassify sockets in order to avoid false positives from lockdep RDMA/bng_re: Remove prefetch instruction RDMA/core: Reduce cond_resched() frequency in __ib_umem_release RDMA/irdma: Fix SRQ shadow area address initialization RDMA/irdma: Remove doorbell elision logic RDMA/irdma: Do not set IBK_LOCAL_DMA_LKEY for GEN3+ RDMA/irdma: Do not directly rely on IB_PD_UNSAFE_GLOBAL_RKEY RDMA/irdma: Add missing mutex destroy RDMA/irdma: Fix SIGBUS in AEQ destroy RDMA/irdma: Add a missing kfree of struct irdma_pci_f for GEN2 RDMA/irdma: Fix data race in irdma_free_pble RDMA/irdma: Fix data race in irdma_sc_ccq_arm RDMA/mlx5: Add support for 1600_8x lane speed RDMA/core: Add new IB rate for XDR (8x) support IB/mlx5: Reduce IMR KSM size when 5-level paging is enabled RDMA/bnxt_re: Pass correct flag for dma mr creation RDMA/bnxt_re: Fix the inline size for GenP7 devices RDMA/hns: Support reset recovery for bond RDMA/hns: Support link state reporting for bond ...
2025-11-26	RDMA/core: Reduce cond_resched() frequency in __ib_umem_release	Li RongQing
	The current implementation calls cond_resched() for every SG entry in __ib_umem_release(), which can increase needless overhead. This patch introduces RESCHED_LOOP_CNT_THRESHOLD (0x1000) to limit how often cond_resched() is called. The function now yields the CPU once every 4096 iterations, and yield at the very first iteration for lots of small umem case, to reduce scheduling overhead. Fixes: d056bc45b62b ("RDMA/core: Prevent soft lockup during large user memory region cleanup") Signed-off-by: Li RongQing <lirongqing@baidu.com> Link: https://patch.msgid.link/20251126025147.2627-1-lirongqing@baidu.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24	RDMA/core: Add new IB rate for XDR (8x) support	Maher Sanalla
	Add the new rates as defined in the Infiniband spec for XDR and 8x link width support. Furthermore, modify the utility conversion methods accordingly. Reference: IB Spec Release 1.8 Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Link: https://patch.msgid.link/20251120-speed-8-v1-1-e6a7efef8cb8@nvidia.com Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-13	RDMA/core: Prevent soft lockup during large user memory region cleanup	Li RongQing
	When a process exits with numerous large, pinned memory regions consisting of 4KB pages, the cleanup of the memory region through __ib_umem_release() may cause soft lockups. This is because unpin_user_page_range_dirty_lock() is called in a tight loop for unpin and releasing page without yielding the CPU. watchdog: BUG: soft lockup - CPU#44 stuck for 26s! [python3:73464] Kernel panic - not syncing: softlockup: hung tasks CPU: 44 PID: 73464 Comm: python3 Tainted: G OEL asm_sysvec_apic_timer_interrupt+0x1b/0x20 RIP: 0010:free_unref_page+0xff/0x190 ? free_unref_page+0xe3/0x190 __put_page+0x77/0xe0 put_compound_head+0xed/0x100 unpin_user_page_range_dirty_lock+0xb2/0x180 __ib_umem_release+0x57/0xb0 [ib_core] ib_umem_release+0x3f/0xd0 [ib_core] mlx5_ib_dereg_mr+0x2e9/0x440 [mlx5_ib] ib_dereg_mr_user+0x43/0xb0 [ib_core] uverbs_free_mr+0x15/0x20 [ib_uverbs] destroy_hw_idr_uobject+0x21/0x60 [ib_uverbs] uverbs_destroy_uobject+0x38/0x1b0 [ib_uverbs] __uverbs_cleanup_ufile+0xd1/0x150 [ib_uverbs] uverbs_destroy_ufile_hw+0x3f/0x100 [ib_uverbs] ib_uverbs_close+0x1f/0xb0 [ib_uverbs] __fput+0x9c/0x280 ____fput+0xe/0x20 task_work_run+0x6a/0xb0 do_exit+0x217/0x3c0 do_group_exit+0x3b/0xb0 get_signal+0x150/0x900 arch_do_signal_or_restart+0xde/0x100 exit_to_user_mode_loop+0xc4/0x160 exit_to_user_mode_prepare+0xa0/0xb0 syscall_exit_to_user_mode+0x27/0x50 do_syscall_64+0x63/0xb0 Fix soft lockup issues by incorporating cond_resched() calls within __ib_umem_release(), and this SG entries are typically grouped in 2MB chunks on x86_64, adding cond_resched() should has minimal performance impact. Signed-off-by: Li RongQing <lirongqing@baidu.com> Link: https://patch.msgid.link/20251113095317.2628-1-lirongqing@baidu.com Acked-by: Junxian Huang <huangjunxian6@hisilicon.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-13	RDMA/restrack: Fix typos in the comments	Kalesh AP
	Fix couple of occurrences of the misspelled word "reource" in the comments with the correct spelling "resource". Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20251113105457.879903-1-kalesh-anakkur.purayil@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06	RDMA/core: WQ_PERCPU added to alloc_workqueue users	Marco Crivellari
	Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This change adds a new WQ_PERCPU flag to explicitly request alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://patch.msgid.link/20251101163121.78400-3-marco.crivellari@suse.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06	RDMA/core: RDMA/mlx5: replace use of system_unbound_wq with system_dfl_wq	Marco Crivellari
	Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistency cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://patch.msgid.link/20251101163121.78400-2-marco.crivellari@suse.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-27	RDMA/cm: Base cm_id destruction timeout on CMA values	Håkon Bugge
	When a GSI MAD packet is sent on the QP, it will potentially be retried CMA_MAX_CM_RETRIES times with a timeout value of: 4.096usec * 2 ^ CMA_CM_RESPONSE_TIMEOUT The above equates to ~64 seconds using the default CMA values. The cm_id_priv's refcount will be incremented for this period. Therefore, the timeout value waiting for a cm_id destruction must be based on the effective timeout of MAD packets. To provide additional leeway, we add 25% to this timeout and use that instead of the constant 10 seconds timeout, which may result in false negatives. Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Link: https://patch.msgid.link/20251021132738.4179604-1-haakon.bugge@oracle.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-19	RDMA/uverbs: Fix umem release in UVERBS_METHOD_CQ_CREATE	Shuhao Fu
	In `UVERBS_METHOD_CQ_CREATE`, umem should be released if anything goes wrong. Currently, if `create_cq_umem` fails, umem would not be released or referenced, causing a possible leak. In this patch, we release umem at `UVERBS_METHOD_CQ_CREATE`, the driver should not release umem if it returns an error code. Fixes: 1a40c362ae26 ("RDMA/uverbs: Add a common way to create CQ with umem") Signed-off-by: Shuhao Fu <sfual@cse.ust.hk> Link: https://patch.msgid.link/aOh1le4YqtYwj-hH@osx.local Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-19	RDMA/core: let rdma_connect_locked() call ↵	Stefan Metzmacher
	lockdep_assert_held(&id_priv->handler_mutex) rdma_accept() also has this, so this is now more consistent and may prevent bugs in future. Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: linux-rdma@vger.kernel.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Link: https://patch.msgid.link/20251008165913.444276-1-metze@samba.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-03	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma	Linus Torvalds
	Pull rdma updates from Jason Gunthorpe: "A new Pensando ionic driver, a new Gen 3 HW support for Intel irdma, and lots of small bnxt_re improvements. - Small bug fixes and improves to hfi1, efa, mlx5, erdma, rdmarvt, siw - Allow userspace access to IB service records through the rdmacm - Optimize dma mapping for erdma - Fix shutdown of the GSI QP in mana - Support relaxed ordering MR and fix a corruption bug with mlx5 DMA Data Direct - Many improvement to bnxt_re: - Debugging features and counters - Improve performance of some commands - Change flow_label reporting in completions - Mirror vnic - RDMA flow support - New RDMA driver for Pensando Ethernet devices: ionic - Gen 3 hardware support for the Intel irdma driver - Fix rdma routing resolution with VRFs" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (85 commits) RDMA/ionic: Fix memory leak of admin q_wr RDMA/siw: Always report immediate post SQ errors RDMA/bnxt_re: improve clarity in ALLOC_PAGE handler RDMA/irdma: Remove unused struct irdma_cq fields RDMA/irdma: Fix positive vs negative error codes in irdma_post_send() RDMA/bnxt_re: Remove non-statistics counters from hw_counters RDMA/bnxt_re: Add debugfs info entry for device and resource information RDMA/bnxt_re: Fix incorrect errno used in function comments RDMA: Use %pe format specifier for error pointers RDMA/ionic: Use ether_addr_copy instead of memcpy RDMA/ionic: Fix build failure on SPARC due to xchg() operand size RDMA/rxe: Fix race in do_task() when draining IB/sa: Fix sa_local_svc_timeout_ms read race IB/ipoib: Ignore L3 master device RDMA/core: Use route entry flag to decide on loopback traffic RDMA/core: Resolve MAC of next-hop device without ARP support RDMA/core: Squash a single user static function RDMA/irdma: Update Kconfig RDMA/irdma: Extend CQE Error and Flush Handling for GEN3 Devices RDMA/irdma: Add Atomic Operations support ...
2025-09-21	RDMA: Use %pe format specifier for error pointers	Leon Romanovsky
	Convert error logging throughout the RDMA subsystem to use the %pe format specifier instead of PTR_ERR() with integer format specifiers. Link: https://patch.msgid.link/e81ec02df1e474be20417fb62e779776e3f47a50.1758217936.git.leon@kernel.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-09-18	IB/sa: Fix sa_local_svc_timeout_ms read race	Vlad Dumitrescu
	When computing the delta, the sa_local_svc_timeout_ms is read without ib_nl_request_lock held. Though unlikely in practice, this can cause a race condition if multiple local service threads are managing the timeout. Fixes: 2ca546b92a02 ("IB/sa: Route SA pathrecord query through netlink") Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Reviewed-by: Mark Zhang <markzhang@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20250916163112.98414-1-edwards@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-18	RDMA/core: Use route entry flag to decide on loopback traffic	Parav Pandit
	addr_resolve() considers a destination to be local if the next-hop device of the resolved route for the destination is the loopback netdevice. This fails when the source and destination IP addresses belong to a netdev enslaved to a VRF netdev. In this case the next-hop device is the VRF itself: $ ip link add name myvrf up type vrf table 100 $ ip link set ens2f0np0 master myvrf up $ ip addr add 192.168.1.1/24 dev ens2f0np0 $ ip route get 192.168.1.1 oif myvrf local 192.168.1.1 dev myvrf table 100 src 192.168.1.1 uid 0 cache <local> This results in packets being generated with an incorrect destination MAC of the VRF netdevice and ib_write_bw failing with timeout. Solve this by determining if a destination is local or not based on the resolved route's type rather than based on its next-hop netdevice loopback flag. This enables to resolve loopback traffic with and without VRF configurations in a uniform way. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20250916111103.84069-4-edwards@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-18	RDMA/core: Resolve MAC of next-hop device without ARP support	Parav Pandit
	Currently, if the next-hop netdevice does not support ARP resolution, the destination MAC address is silently set to zero without reporting an error. This leads to incorrect behavior and may result in packet transmission failures. Fix this by deferring MAC resolution to the IP stack via neighbour lookup, allowing proper resolution or error reporting as appropriate. Fixes: 7025fcd36bd6 ("IB: address translation to map IP toIB addresses (GIDs)") Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20250916111103.84069-3-edwards@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-18	RDMA/core: Squash a single user static function	Parav Pandit
	To reduce dependencies in IFF_LOOPBACK in route and neighbour resolution steps, squash the static function to its single caller and simplify the code. Until now, network field was set even when neighbour resolution failed. With this change, dev_addr output fields are valid only when resolution is successful. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20250916111103.84069-2-edwards@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-15	RDMA/cm: Rate limit destroy CM ID timeout error message	Håkon Bugge
	When the destroy CM ID timeout kicks in, you typically get a storm of them which creates a log flooding. Hence, change pr_err() to pr_err_ratelimited() in cm_destroy_id_wait_timeout(). Fixes: 96d9cbe2f2ff ("RDMA/cm: add timeout to cm_destroy_id wait") Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Link: https://patch.msgid.link/20250912100525.531102-1-haakon.bugge@oracle.com Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-11	RDMA/core: fix "truely"->"truly"	Xichao Zhao
	Trivial fix to spelling mistake in comment text. Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com> Link: https://patch.msgid.link/20250827120007.489496-1-zhao.xichao@vivo.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-08-13	RDMA/core: Free pfn_list with appropriate kvfree call	Akhilesh Patil
	Ensure that pfn_list allocated by kvcalloc() is freed using corresponding kvfree() function. Match memory allocation and free routines kvcalloc -> kvfree. Fixes: 259e9bd07c57 ("RDMA/core: Avoid hmm_dma_map_alloc() for virtual DMA devices") Signed-off-by: Akhilesh Patil <akhilesh@ee.iitb.ac.in> Link: https://patch.msgid.link/aJjcPjL1BVh8QrMN@bhairav-test.ee.iitb.ac.in Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-08-13	RDMA/ucma: Support write an event into a CM	Mark Zhang
	Enable user-space to inject an event into a CM through it's event channel. Two new events are added and supported: RDMA_CM_EVENT_USER and RDMA_CM_EVENT_INTERNAL. With these 2 events a new event parameter "arg" is supported, which is passed from sender to receiver transparently. With this feature an application is able to write an event into a CM channel with a new user-space rdmacm API. For example thread T1 could write an event with the API: rdma_write_cm_event(cm_id, RDMA_CM_EVENT_USER, status, arg); and thread T2 could receive the event with rdma_get_cm_event(). Signed-off-by: Mark Zhang <markzhang@nvidia.com> Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Link: https://patch.msgid.link/fdf49d0b17a45933c5d8c1d90605c9447d9a3c73.1751279794.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-08-13	RDMA/ucma: Support query resolved service records	Mark Zhang
	Enable user-space to query resolved service records through a ucma command when a RDMA_CM_EVENT_ADDRINFO_RESOLVED event is received. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Signed-off-by: Mark Zhang <markzhang@nvidia.com> Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Link: https://patch.msgid.link/1090ee7c00c3f8058c4f9e7557de983504a16715.1751279794.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-08-13	RDMA/cma: Support IB service record resolution	Mark Zhang
	Add new UCMA command and the corresponding CMA implementation. Userspace can send this command to request service resolution based on service name or ID. On a successful resolution, one or multiple service records are returned, the first one will be used as destination address by default. Two new CM events are added and returned to caller accordingly: - RDMA_CM_EVENT_ADDRINFO_RESOLVED: Resolve succeeded; - RDMA_CM_EVENT_ADDRINFO_ERROR: Resolve failed. Internally two new CM states are added: - RDMA_CM_ADDRINFO_QUERY: CM is in the process of IB service resolution; - RDMA_CM_ADDRINFO_RESOLVED: CM has finished the resolve process. With these new states, beside existing state transfer processes, 2 new processes are supported: 1. The default address is used: RDMA_CM_ADDR_BOUND -> RDMA_CM_ADDRINFO_QUERY -> RDMA_CM_ADDRINFO_RESOLVED -> RDMA_CM_ROUTE_QUERY 2. To use a different address: RDMA_CM_ADDR_BOUND -> RDMA_CM_ADDRINFO_QUERY-> RDMA_CM_ADDRINFO_RESOLVED -> RDMA_CM_ADDR_QUERY -> RDMA_CM_ADDR_RESOLVED -> RDMA_CM_ROUTE_QUERY In the 2nd case, resolve_addrinfo returns multiple records, a user could call rdma_resolve_addr() with the one that is not the first. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Signed-off-by: Mark Zhang <markzhang@nvidia.com> Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Link: https://patch.msgid.link/b6e82ad75522a13b5efe4ff86da0e465aab04cc2.1751279794.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-08-13	RDMA/sa_query: Support IB service records resolution	Mark Zhang
	Add an SA query API ib_sa_service_rec_get() to support building and sending SA query MADs that ask for service records with a specific name or ID, and receiving and parsing responses from the SM. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Signed-off-by: Mark Zhang <markzhang@nvidia.com> Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Link: https://patch.msgid.link/9af6c82f3a3a9d975115a33235fb4ffc7c8edb21.1751279793.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-08-13	RDMA/sa_query: Add RMPP support for SA queries	Mark Zhang
	Register GSI mad agent with RMPP support and add rmpp_callback for SA queries. This is needed for querying more than one service record in one query. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Signed-off-by: Mark Zhang <markzhang@nvidia.com> Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Link: https://patch.msgid.link/81dbcb48682e1838dc40f381cdcc0dc63f25f0f1.1751279793.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-31	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma	Linus Torvalds
	Pull rdma updates from Jason Gunthorpe: - Various minor code cleanups and fixes for hns, iser, cxgb4, hfi1, rxe, erdma, mana_ib - Prefetch supprot for rxe ODP - Remove memory window support from hns as new device FW is no longer support it - Remove qib, it is very old and obsolete now, Cornelis wishes to restructure the hfi1/qib shared layer - Fix a race in destroying CQs where we can still end up with work running because the work is cancled before the driver stops triggering it - Improve interaction with namespaces: * Follow the devlink namespace for newly spawned RDMA devices * Create iopoib net devces in the parent IB device's namespace * Allow CAP_NET_RAW checks to pass in user namespaces - A new flow control scheme for IB MADs to try and avoid queue overflows in the network - Fix 2G message sizes in bnxt_re - Optimize mkey layout for mlx5 DMABUF - New "DMA Handle" concept to allow controlling PCI TPH and steering tags * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (71 commits) RDMA/siw: Change maintainer email address RDMA/mana_ib: add support of multiple ports RDMA/mlx5: Refactor optional counters steering code RDMA/mlx5: Add DMAH support for reg_user_mr/reg_user_dmabuf_mr IB: Extend UVERBS_METHOD_REG_MR to get DMAH RDMA/mlx5: Add DMAH object support RDMA/core: Introduce a DMAH object and its alloc/free APIs IB/core: Add UVERBS_METHOD_REG_MR on the MR object net/mlx5: Add support for device steering tag net/mlx5: Expose IFC bits for TPH PCI/TPH: Expose pcie_tph_get_st_table_size() RDMA/mlx5: Fix incorrect MKEY masking RDMA/mlx5: Fix returned type from _mlx5r_umr_zap_mkey() RDMA/mlx5: remove redundant check on err on return expression RDMA/mana_ib: add additional port counters RDMA/mana_ib: Fix DSCP value in modify QP RDMA/efa: Add CQ with external memory support RDMA/core: Add umem "is_contiguous" and "start_dma_addr" helpers RDMA/uverbs: Add a common way to create CQ with umem RDMA/mlx5: Optimize DMABUF mkey page size ...
2025-07-23	IB: Extend UVERBS_METHOD_REG_MR to get DMAH	Yishai Hadas
	Extend UVERBS_METHOD_REG_MR to get DMAH and pass it to all drivers. It will be used in mlx5 driver as part of the next patch from the series. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/2ae1e628c0675db81f092cc00d3ad6fbf6139405.1752752567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-23	RDMA/core: Introduce a DMAH object and its alloc/free APIs	Yishai Hadas
	Introduce a new DMA handle (DMAH) object along with its corresponding allocation and deallocation APIs. This DMAH object encapsulates attributes intended for use in DMA transactions. While its initial purpose is to support TPH functionality, it is designed to be extensible for future features such as DMA PCI multipath, PCI UIO configurations, PCI traffic class selection, and more. Further details: ---------------- We ensure that a caller requesting a DMA handle for a specific CPU ID is permitted to be scheduled on it. This prevent a potential security issue where a non privilege user may trigger DMA operations toward a CPU that it's not allowed to run on. We manage reference counting for the DMAH object and its consumers (e.g., memory regions) as will be detailed in subsequent patches in the series. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/2cad097e849597e49d6b61e6865dba878257f371.1752752567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-23	IB/core: Add UVERBS_METHOD_REG_MR on the MR object	Yishai Hadas
	This new method enables us to use a single ioctl from user space which supports the below variants of reg_mr [1]. The method will be extended in the next patches from the series with an extra attribute to let us pass DMA handle to be used as part of the registration. [1] ibv_reg_mr(), ibv_reg_mr_iova(), ibv_reg_mr_iova2(), ibv_reg_dmabuf_mr(). Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/5a3822ceef084efe967c9752e89c58d8250337c7.1752752567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13	RDMA/uverbs: Add a common way to create CQ with umem	Michael Margolin
	Add ioctl command attributes and a common handling for the option to create CQs with memory buffers passed from userspace. When required attributes are supplied, create umem and provide it for driver's use. The extension enables creation of CQs on top of preallocated CPU virtual or device memory buffers, by supplying VA or dmabuf fd, in a common way. Drivers can support this flow by initializing a new create_cq_umem fp field in their ops struct, with a function that can handle the new parameter. Signed-off-by: Michael Margolin <mrgolin@amazon.com> Link: https://patch.msgid.link/20250708202308.24783-2-mrgolin@amazon.com Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-09	IB/cm: Use separate agent w/o flow control for REP	Vlad Dumitrescu
	Most responses (e.g., RTU) are not subject to flow control, as there is no further response expected. However, REPs are both requests (waiting for RTUs) and responses (being waited by REQs). With agent-level flow control added to the MAD layer, REPs can get delayed by outstanding REQs. This can cause a problem in a scenario such as 2 hosts connecting to each other at the same time. Both hosts fill the flow control outstanding slots with REQs. The corresponding REPs are now blocked behind those REQs, and neither side can make progress until REQs time out. Add a separate MAD agent which is only used to send REPs. This agent does not have a recv_handler as it doesn't process responses nor does it register to receive requests. Disable flow control for agents w/o a recv_handler, as they aren't waiting for responses. This allows the newly added REP agent to send even when clients are slow to generate RTU, which would be needed to unblock flow control outstanding slots. Relax check in ib_post_send_mad to allow retries for this agent. REPs will be retried by the MAD layer until CM layer receives a response (e.g., RTU) on the normal agent and cancels them. Suggested-by: Sean Hefty <shefty@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Sean Hefty <shefty@nvidia.com> Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Link: https://patch.msgid.link/9ac12d0842b849e2c8537d6e291ee0af9f79855c.1751278420.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-09	IB/mad: Add flow control for solicited MADs	Or Har-Toov
	Currently, MADs sent via an agent are being forwarded directly to the corresponding MAD QP layer. MADs with a timeout value set and requiring a response (solicited MADs) will be resent if the timeout expires without receiving a response. In a congested subnet, flooding MAD QP layer with more solicited send requests from the agent will only worsen the situation by triggering more timeouts and therefore more retries. Thus, add flow control for non-user solicited MADs to block agents from issuing new solicited MAD requests to the MAD QP until outstanding requests are completed and the MAD QP is ready to process additional requests. While at it, keep track of the total outstanding solicited MAD work requests in send or wait list. The number of outstanding send WRs will be limited by a fraction of the RQ size, and any new send WR that exceeds that limit will be held in a backlog list. Backlog MADs will be forwarded to agent send list only once the total number of outstanding send WRs falls below the limit. Unsolicited MADs, RMPP MADs and MADs which are not SA, SMP or CM are not subject to this flow control mechanism and will not be affected by this change. For this purpose, a new state is introduced: - 'IB_MAD_STATE_QUEUED': MAD is in backlog list Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Link: https://patch.msgid.link/c0ecaa1821badee124cd13f3bf860f67ce453beb.1751278420.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-09	IB/mad: Add state machine to MAD layer	Or Har-Toov
	Replace the use of refcount, timeout and status with a 'state' field to track the status of MADs send work requests (WRs). The state machine better represents the stages in the MAD lifecycle, specifically indicating whether the MAD is waiting for a completion, waiting for a response, was canceld or is done. The existing refcount only takes two values: 1 : MAD is waiting either for completion or for response. 2 : MAD is waiting for both response and completion. Also when a response was received before a completion notification. The status field represents if the MAD was canceled at some point in the flow. The timeout is used to represent if a response was received. The current state transitions are not clearly visible, and developers needs to infer the state from the refcount's, timeout's or status's value, which is error-prone and difficult to follow. Thus, replace with a state machine as the following: - 'IB_MAD_STATE_INIT': MAD is in the making and is not yet in any list - 'IB_MAD_STATE_SEND_START': MAD was sent to the QP and is waiting for completion notification in send list - 'IB_MAD_STATE_WAIT_RESP': MAD send completed successfully, waiting for a response in wait list - 'IB_MAD_STATE_EARLY_RESP': Response came early, before send completion notification, MAD is in the send list - 'IB_MAD_STATE_CANCELED': MAD was canceled while in send or wait list - 'IB_MAD_STATE_DONE': MAD processing completed, MAD is in no list Adding the state machine also make it possible to remove the double call for ib_mad_complete_send_wr in case of an early response and the use of a done list in case of a regular response. While at it, define a helper to clear error MADs which will handle freeing MADs that timed out or have been cancelled. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Link: https://patch.msgid.link/48e6ae8689dc7bb8b4ba6e5ec562e1b018db88a8.1751278420.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02	RDMA/counter: Check CAP_NET_RAW check in user namespace for RDMA counters	Parav Pandit
	Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 1bd8e0a9d0fd ("RDMA/counter: Allow manual mode configuration support") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/68e2064e72e94558a576fdbbb987681a64f6fea8.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02	RDMA/nldev: Check CAP_NET_RAW in user namespace for QP modify	Parav Pandit
	Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to modify the QP. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 0cadb4db79e1 ("RDMA/uverbs: Restrict usage of privileged QKEYs") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/099eb263622ccdd27014db7e02fec824a3307829.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02	RDMA/uverbs: Check CAP_NET_RAW in user namespace for RAW QP create	Parav Pandit
	Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the QP. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/3914ef9702b01de8843a391ce397fca67d0fc7af.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-01	RDMA/uverbs: Check CAP_NET_RAW in user namespace for RAW QP create	Parav Pandit
	Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the QP. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 6d1e7ba241e9 ("IB/uverbs: Introduce create/destroy QP commands over ioctl") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/7b6b87505ccc28a1f7b4255af94d898d2df0fff5.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-01	RDMA/uverbs: Check CAP_NET_RAW in user namespace for QP create	Parav Pandit
	Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the QP. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 2dee0e545894 ("IB/uverbs: Enable QP creation with a given source QP number") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/0e5920d1dfe836817bb07576b192da41b637130b.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-01	RDMA/uverbs: Check CAP_NET_RAW in user namespace for flow create	Parav Pandit
	Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the flow resource. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 436f2ad05a0b ("IB/core: Export ib_create/destroy_flow through uverbs") Signed-off-by: Parav Pandit <parav@nvidia.com> Suggested-by: Eric W. Biederman <ebiederm@xmission.com> Link: https://patch.msgid.link/6df6f2f24627874c4f6d041c19dc1f6f29f68f84.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26	RDMA/core: Extend RDMA device registration to be net namespace aware	Mark Bloch
	Presently, RDMA devices are always registered within the init network namespace, even if the associated devlink device's namespace was changed via a devlink reload. This mismatch leads to discrepancies between the network namespace of the devlink device and that of the RDMA device. Therefore, extend the RDMA device allocation API to optionally take the net namespace. This isn't limited to devices that support devlink but allows all users to provide the network namespace if they need to do so. If a network namespace is provided during device allocation, it's up to the caller to make sure the namespace stays valid until ib_register_device() is called. Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-06-25	RDMA/core: reduce stack using in nldev_stat_get_doit()	Arnd Bergmann
	In the s390 defconfig, gcc-10 and earlier end up inlining three functions into nldev_stat_get_doit(), and each of them uses some 600 bytes of stack. The result is a function with an overly large stack frame and a warning: drivers/infiniband/core/nldev.c:2466:1: error: the frame size of 1720 bytes is larger than 1280 bytes [-Werror=frame-larger-than=] Mark the three functions noinline_for_stack to prevent this, ensuring that only one copy of the nlattr array is on the stack of each function. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20250620113335.3776965-1-arnd@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25	RDMA/core: Add driver APIs pre_destroy_cq() and post_destroy_cq()	Mark Zhang
	Currently in ib_free_cq, it disables IRQ or cancel the CQ work before driver destroy_cq. This isn't good as a new IRQ or a CQ work can be submitted immediately after disabling IRQ or canceling CQ work, which may run concurrently with destroy_cq and cause crashes. The right flow should be: 1. Driver disables CQ to make sure no new CQ event will be submitted; 2. Disables IRQ or Cancels CQ work in core layer, to make sure no CQ polling work is running; 3. Free all resources to destroy the CQ. This patch adds 2 driver APIs: - pre_destroy_cq(): Disable a CQ to prevent it from generating any new work completions, but not free any kernel resources; - post_destroy_cq(): Free all kernel resources. In ib_free_cq, the IRQ is disabled or CQ work is canceled after pre_destroy_cq, and before post_destroy_cq. Fixes: 14d3a3b2498e ("IB: add a proper completion queue abstraction") Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://patch.msgid.link/b5f7ae3d75f44a3e15ff3f4eb2bbdea13e06b97f.1750062328.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25	IB/core: Annotate umem_mutex acquisition under fs_reclaim for lockdep	Or Har-Toov
	Following the fix in the previous commit ("IB/mlx5: Fix potential deadlock in MR deregistration"), teach lockdep explicitly about the locking order between fs_reclaim and umem_mutex. The previous commit resolved a potential deadlock scenario where kzalloc(GFP_KERNEL) was called while holding umem_mutex, which could lead to reclaim and eventually invoke the MMU notifier (mlx5_ib_invalidate_range()), causing a recursive acquisition of umem_mutex. To prevent such issues from reoccurring unnoticed in future code changes, add a lockdep annotation in ib_init_umem_odp() that simulates taking umem_mutex inside a reclaim context. This makes lockdep aware of this locking dependency and ensures that future violations—such as calling kzalloc() or any memory allocator that may enter reclaim while holding umem_mutex—will immediately raise a lockdep warning. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/9d31b9d8fe1db648a9f47cec3df6b8463319dee5.1750061698.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-17	RDMA/core: Rate limit GID cache warning messages	Maor Gottlieb
	The GID cache warning messages can flood the kernel log when there are multiple failed attempts to add GIDs. This can happen when creating many virtual interfaces without having enough space for their GIDs in the GID table. Change pr_warn to pr_warn_ratelimited to prevent log flooding while still maintaining visibility of the issue. Link: https://patch.msgid.link/r/fd45ed4a1078e743f498b234c3ae816610ba1b18.1750062357.git.leon@kernel.org Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-05-26	RDMA/cma: Fix hang when cma_netevent_callback fails to queue_work	Jack Morgenstein
	The cited commit fixed a crash when cma_netevent_callback was called for a cma_id while work on that id from a previous call had not yet started. The work item was re-initialized in the second call, which corrupted the work item currently in the work queue. However, it left a problem when queue_work fails (because the item is still pending in the work queue from a previous call). In this case, cma_id_put (which is called in the work handler) is therefore not called. This results in a userspace process hang (zombie process). Fix this by calling cma_id_put() if queue_work fails. Fixes: 45f5dcdd0497 ("RDMA/cma: Fix workqueue crash in cma_netevent_work_handler") Link: https://patch.msgid.link/r/4f3640b501e48d0166f312a64fdadf72b059bd04.1747827103.git.leon@kernel.org Signed-off-by: Jack Morgenstein <jackm@nvidia.com> Signed-off-by: Feng Liu <feliu@nvidia.com> Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Sharath Srinivasan <sharath.srinivasan@oracle.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-05-26	Merge tag 'v6.15' into rdma.git for-next	Jason Gunthorpe
	Following patches need the RDMA rc branch since we are past the RC cycle now. Merge conflicts resolved based on Linux-next: - For RXE odp changes keep for-next version and fixup new places that need to call is_odp_mr() https://lore.kernel.org/r/20250422143019.500201bd@canb.auug.org.au https://lore.kernel.org/r/20250514122455.3593b083@canb.auug.org.au - irdma is keeping the while/kfree bugfix from -rc and the pf/cdev_info change from for-next https://lore.kernel.org/r/20250513130630.280ee6c5@canb.auug.org.au Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-05-25	IB/cm: Remove dead code and adjust naming	Vlad Dumitrescu
	Drop ib_send_cm_mra parameters which are always constant. Remove branch which is never taken. Adjust name to ib_prepare_cm_mra, which better reflects its functionality - no MRA is actually sent. Adjust name of related tracepoints. Push setting of the constant service timeout to cm.c and drop IB_CM_MRA_FLAG_DELAY. Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Reviewed-by: Sean Hefty <shefty@nvidia.com> Link: https://patch.msgid.link/cdd2a237acf2b495c19ce02e4b1c42c41c6751c2.1747827207.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-25	RDMA/core: Avoid hmm_dma_map_alloc() for virtual DMA devices	Daisuke Matsuda
	Drivers such as rxe, which use virtual DMA, must not call into the DMA mapping core since they lack physical DMA capabilities. Otherwise, a NULL pointer dereference is observed as shown below. This patch ensures the RDMA core handles virtual and physical DMA paths appropriately. This fixes the following kernel oops: BUG: kernel NULL pointer dereference, address: 00000000000002fc #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 1028eb067 P4D 1028eb067 PUD 105da0067 PMD 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 3 UID: 1000 PID: 1854 Comm: python3 Tainted: G W 6.15.0-rc1+ #11 PREEMPT(voluntary) Tainted: [W]=WARN Hardware name: Trigkey Key N/Key N, BIOS KEYN101 09/02/2024 RIP: 0010:hmm_dma_map_alloc+0x25/0x100 Code: 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 49 89 d6 49 c1 e6 0c 41 55 41 54 53 49 39 ce 0f 82 c6 00 00 00 49 89 fc <f6> 87 fc 02 00 00 20 0f 84 af 00 00 00 49 89 f5 48 89 d3 49 89 cf RSP: 0018:ffffd3d3420eb830 EFLAGS: 00010246 RAX: 0000000000001000 RBX: ffff8b727c7f7400 RCX: 0000000000001000 RDX: 0000000000000001 RSI: ffff8b727c7f74b0 RDI: 0000000000000000 RBP: ffffd3d3420eb858 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 00007262a622a000 R14: 0000000000001000 R15: ffff8b727c7f74b0 FS: 00007262a62a1080(0000) GS:ffff8b762ac3e000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000002fc CR3: 000000010a1f0004 CR4: 0000000000f72ef0 PKRU: 55555554 Call Trace: <TASK> ib_init_umem_odp+0xb6/0x110 [ib_uverbs] ib_umem_odp_get+0xf0/0x150 [ib_uverbs] rxe_odp_mr_init_user+0x71/0x170 [rdma_rxe] rxe_reg_user_mr+0x217/0x2e0 [rdma_rxe] ib_uverbs_reg_mr+0x19e/0x2e0 [ib_uverbs] ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xd9/0x150 [ib_uverbs] ib_uverbs_cmd_verbs+0xd19/0xee0 [ib_uverbs] ? mmap_region+0x63/0xd0 ? __pfx_ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x10/0x10 [ib_uverbs] ib_uverbs_ioctl+0xba/0x130 [ib_uverbs] __x64_sys_ioctl+0xa4/0xe0 x64_sys_call+0x1178/0x2660 do_syscall_64+0x7e/0x170 ? syscall_exit_to_user_mode+0x4e/0x250 ? do_syscall_64+0x8a/0x170 ? do_syscall_64+0x8a/0x170 ? syscall_exit_to_user_mode+0x4e/0x250 ? do_syscall_64+0x8a/0x170 ? syscall_exit_to_user_mode+0x4e/0x250 ? do_syscall_64+0x8a/0x170 ? do_user_addr_fault+0x1d2/0x8d0 ? irqentry_exit_to_user_mode+0x43/0x250 ? irqentry_exit+0x43/0x50 ? exc_page_fault+0x93/0x1d0 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7262a6124ded Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00 RSP: 002b:00007fffd08c3960 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fffd08c39f0 RCX: 00007262a6124ded RDX: 00007fffd08c3a10 RSI: 00000000c0181b01 RDI: 0000000000000007 RBP: 00007fffd08c39b0 R08: 0000000014107820 R09: 00007fffd08c3b44 R10: 000000000000000c R11: 0000000000000246 R12: 00007fffd08c3b44 R13: 000000000000000c R14: 00007fffd08c3b58 R15: 0000000014107960 </TASK> Fixes: 1efe8c0670d6 ("RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage") Closes: https://lore.kernel.org/all/3e8f343f-7d66-4f7a-9f08-3910623e322f@gmail.com/ Signed-off-by: Daisuke Matsuda <dskmtsd@gmail.com> Link: https://patch.msgid.link/20250524144328.4361-1-dskmtsd@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-13	RDMA/iwcm: Fix use-after-free of work objects after cm_id destruction	Shin'ichiro Kawasaki
	The commit 59c68ac31e15 ("iw_cm: free cm_id resources on the last deref") simplified cm_id resource management by freeing cm_id once all references to the cm_id were removed. The references are removed either upon completion of iw_cm event handlers or when the application destroys the cm_id. This commit introduced the use-after-free condition where cm_id_private object could still be in use by event handler works during the destruction of cm_id. The commit aee2424246f9 ("RDMA/iwcm: Fix a use-after-free related to destroying CM IDs") addressed this use-after- free by flushing all pending works at the cm_id destruction. However, still another use-after-free possibility remained. It happens with the work objects allocated for each cm_id_priv within alloc_work_entries() during cm_id creation, and subsequently freed in dealloc_work_entries() once all references to the cm_id are removed. If the cm_id's last reference is decremented in the event handler work, the work object for the work itself gets removed, and causes the use- after-free BUG below: BUG: KASAN: slab-use-after-free in __pwq_activate_work+0x1ff/0x250 Read of size 8 at addr ffff88811f9cf800 by task kworker/u16:1/147091 CPU: 2 UID: 0 PID: 147091 Comm: kworker/u16:1 Not tainted 6.15.0-rc2+ #27 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 Workqueue: 0x0 (iw_cm_wq) Call Trace: <TASK> dump_stack_lvl+0x6a/0x90 print_report+0x174/0x554 ? __virt_addr_valid+0x208/0x430 ? __pwq_activate_work+0x1ff/0x250 kasan_report+0xae/0x170 ? __pwq_activate_work+0x1ff/0x250 __pwq_activate_work+0x1ff/0x250 pwq_dec_nr_in_flight+0x8c5/0xfb0 process_one_work+0xc11/0x1460 ? __pfx_process_one_work+0x10/0x10 ? assign_work+0x16c/0x240 worker_thread+0x5ef/0xfd0 ? __pfx_worker_thread+0x10/0x10 kthread+0x3b0/0x770 ? __pfx_kthread+0x10/0x10 ? rcu_is_watching+0x11/0xb0 ? _raw_spin_unlock_irq+0x24/0x50 ? rcu_is_watching+0x11/0xb0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Allocated by task 147416: kasan_save_stack+0x2c/0x50 kasan_save_track+0x10/0x30 __kasan_kmalloc+0xa6/0xb0 alloc_work_entries+0xa9/0x260 [iw_cm] iw_cm_connect+0x23/0x4a0 [iw_cm] rdma_connect_locked+0xbfd/0x1920 [rdma_cm] nvme_rdma_cm_handler+0x8e5/0x1b60 [nvme_rdma] cma_cm_event_handler+0xae/0x320 [rdma_cm] cma_work_handler+0x106/0x1b0 [rdma_cm] process_one_work+0x84f/0x1460 worker_thread+0x5ef/0xfd0 kthread+0x3b0/0x770 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 Freed by task 147091: kasan_save_stack+0x2c/0x50 kasan_save_track+0x10/0x30 kasan_save_free_info+0x37/0x60 __kasan_slab_free+0x4b/0x70 kfree+0x13a/0x4b0 dealloc_work_entries+0x125/0x1f0 [iw_cm] iwcm_deref_id+0x6f/0xa0 [iw_cm] cm_work_handler+0x136/0x1ba0 [iw_cm] process_one_work+0x84f/0x1460 worker_thread+0x5ef/0xfd0 kthread+0x3b0/0x770 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 Last potentially related work creation: kasan_save_stack+0x2c/0x50 kasan_record_aux_stack+0xa3/0xb0 __queue_work+0x2ff/0x1390 queue_work_on+0x67/0xc0 cm_event_handler+0x46a/0x820 [iw_cm] siw_cm_upcall+0x330/0x650 [siw] siw_cm_work_handler+0x6b9/0x2b20 [siw] process_one_work+0x84f/0x1460 worker_thread+0x5ef/0xfd0 kthread+0x3b0/0x770 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 This BUG is reproducible by repeating the blktests test case nvme/061 for the rdma transport and the siw driver. To avoid the use-after-free of cm_id_private work objects, ensure that the last reference to the cm_id is decremented not in the event handler works, but in the cm_id destruction context. For that purpose, move iwcm_deref_id() call from destroy_cm_id() to the callers of destroy_cm_id(). In iw_destroy_cm_id(), call iwcm_deref_id() after flushing the pending works. During the fix work, I noticed that iw_destroy_cm_id() is called from cm_work_handler() and process_event() context. However, the comment of iw_destroy_cm_id() notes that the function "cannot be called by the event thread". Drop the false comment. Closes: https://lore.kernel.org/linux-rdma/r5676e754sv35aq7cdsqrlnvyhiq5zktteaurl7vmfih35efko@z6lay7uypy3c/ Fixes: 59c68ac31e15 ("iw_cm: free cm_id resources on the last deref") Cc: stable@vger.kernel.org Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://patch.msgid.link/20250510101036.1756439-1-shinichiro.kawasaki@wdc.com Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-12	RDMA/umem: Separate implicit ODP initialization from explicit ODP	Leon Romanovsky
	Create separate functions for the implicit ODP initialization which is different from the explicit ODP initialization. Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-05-12	RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage	Leon Romanovsky
	Reuse newly added DMA API to cache IOVA and only link/unlink pages in fast path for UMEM ODP flow. Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>