| Age | Commit message (Collapse) | Author |
|
Pull rdma updates from Jason Gunthorpe:
"This has another new RDMA driver 'bng_en' for latest generation
Broadcom NICs. There might be one more new driver still to come.
Otherwise it is a fairly quite cycle. Summary:
- Minor driver bug fixes and updates to cxgb4, rxe, rdmavt, bnxt_re,
mlx5
- Many bug fix patches for irdma
- WQ_PERCPU annotations and system_dfl_wq changes
- Improved mlx5 support for "other eswitches" and multiple PFs
- 1600Gbps link speed reporting support. Four Digits Now!
- New driver bng_en for latest generation Broadcom NICs
- Bonding support for hns
- Adjust mlx5's hmm based ODP to work with the very large address
space created by the new 5 level paging default on x86
- Lockdep fixups in rxe and siw"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (65 commits)
RDMA/rxe: reclassify sockets in order to avoid false positives from lockdep
RDMA/siw: reclassify sockets in order to avoid false positives from lockdep
RDMA/bng_re: Remove prefetch instruction
RDMA/core: Reduce cond_resched() frequency in __ib_umem_release
RDMA/irdma: Fix SRQ shadow area address initialization
RDMA/irdma: Remove doorbell elision logic
RDMA/irdma: Do not set IBK_LOCAL_DMA_LKEY for GEN3+
RDMA/irdma: Do not directly rely on IB_PD_UNSAFE_GLOBAL_RKEY
RDMA/irdma: Add missing mutex destroy
RDMA/irdma: Fix SIGBUS in AEQ destroy
RDMA/irdma: Add a missing kfree of struct irdma_pci_f for GEN2
RDMA/irdma: Fix data race in irdma_free_pble
RDMA/irdma: Fix data race in irdma_sc_ccq_arm
RDMA/mlx5: Add support for 1600_8x lane speed
RDMA/core: Add new IB rate for XDR (8x) support
IB/mlx5: Reduce IMR KSM size when 5-level paging is enabled
RDMA/bnxt_re: Pass correct flag for dma mr creation
RDMA/bnxt_re: Fix the inline size for GenP7 devices
RDMA/hns: Support reset recovery for bond
RDMA/hns: Support link state reporting for bond
...
|
|
The current implementation calls cond_resched() for every SG entry
in __ib_umem_release(), which can increase needless overhead.
This patch introduces RESCHED_LOOP_CNT_THRESHOLD (0x1000) to limit
how often cond_resched() is called. The function now yields the CPU
once every 4096 iterations, and yield at the very first iteration
for lots of small umem case, to reduce scheduling overhead.
Fixes: d056bc45b62b ("RDMA/core: Prevent soft lockup during large user memory region cleanup")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://patch.msgid.link/20251126025147.2627-1-lirongqing@baidu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add the new rates as defined in the Infiniband spec for XDR and 8x
link width support.
Furthermore, modify the utility conversion methods accordingly.
Reference: IB Spec Release 1.8
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Link: https://patch.msgid.link/20251120-speed-8-v1-1-e6a7efef8cb8@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
When a process exits with numerous large, pinned memory regions consisting
of 4KB pages, the cleanup of the memory region through __ib_umem_release()
may cause soft lockups. This is because unpin_user_page_range_dirty_lock()
is called in a tight loop for unpin and releasing page without yielding the
CPU.
watchdog: BUG: soft lockup - CPU#44 stuck for 26s! [python3:73464]
Kernel panic - not syncing: softlockup: hung tasks
CPU: 44 PID: 73464 Comm: python3 Tainted: G OEL
asm_sysvec_apic_timer_interrupt+0x1b/0x20
RIP: 0010:free_unref_page+0xff/0x190
? free_unref_page+0xe3/0x190
__put_page+0x77/0xe0
put_compound_head+0xed/0x100
unpin_user_page_range_dirty_lock+0xb2/0x180
__ib_umem_release+0x57/0xb0 [ib_core]
ib_umem_release+0x3f/0xd0 [ib_core]
mlx5_ib_dereg_mr+0x2e9/0x440 [mlx5_ib]
ib_dereg_mr_user+0x43/0xb0 [ib_core]
uverbs_free_mr+0x15/0x20 [ib_uverbs]
destroy_hw_idr_uobject+0x21/0x60 [ib_uverbs]
uverbs_destroy_uobject+0x38/0x1b0 [ib_uverbs]
__uverbs_cleanup_ufile+0xd1/0x150 [ib_uverbs]
uverbs_destroy_ufile_hw+0x3f/0x100 [ib_uverbs]
ib_uverbs_close+0x1f/0xb0 [ib_uverbs]
__fput+0x9c/0x280
____fput+0xe/0x20
task_work_run+0x6a/0xb0
do_exit+0x217/0x3c0
do_group_exit+0x3b/0xb0
get_signal+0x150/0x900
arch_do_signal_or_restart+0xde/0x100
exit_to_user_mode_loop+0xc4/0x160
exit_to_user_mode_prepare+0xa0/0xb0
syscall_exit_to_user_mode+0x27/0x50
do_syscall_64+0x63/0xb0
Fix soft lockup issues by incorporating cond_resched() calls within
__ib_umem_release(), and this SG entries are typically grouped in 2MB
chunks on x86_64, adding cond_resched() should has minimal performance
impact.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://patch.msgid.link/20251113095317.2628-1-lirongqing@baidu.com
Acked-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Fix couple of occurrences of the misspelled word "reource"
in the comments with the correct spelling "resource".
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20251113105457.879903-1-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-3-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.
system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.
Adding system_dfl_wq to encourage its use when unbound work should be used.
The old system_unbound_wq will be kept for a few release cycles.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-2-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
When a GSI MAD packet is sent on the QP, it will potentially be
retried CMA_MAX_CM_RETRIES times with a timeout value of:
4.096usec * 2 ^ CMA_CM_RESPONSE_TIMEOUT
The above equates to ~64 seconds using the default CMA values.
The cm_id_priv's refcount will be incremented for this period.
Therefore, the timeout value waiting for a cm_id destruction must be
based on the effective timeout of MAD packets. To provide additional
leeway, we add 25% to this timeout and use that instead of the
constant 10 seconds timeout, which may result in false negatives.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Link: https://patch.msgid.link/20251021132738.4179604-1-haakon.bugge@oracle.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
In `UVERBS_METHOD_CQ_CREATE`, umem should be released if anything goes
wrong. Currently, if `create_cq_umem` fails, umem would not be
released or referenced, causing a possible leak.
In this patch, we release umem at `UVERBS_METHOD_CQ_CREATE`, the driver
should not release umem if it returns an error code.
Fixes: 1a40c362ae26 ("RDMA/uverbs: Add a common way to create CQ with umem")
Signed-off-by: Shuhao Fu <sfual@cse.ust.hk>
Link: https://patch.msgid.link/aOh1le4YqtYwj-hH@osx.local
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
lockdep_assert_held(&id_priv->handler_mutex)
rdma_accept() also has this, so this is now more consistent and may
prevent bugs in future.
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Link: https://patch.msgid.link/20251008165913.444276-1-metze@samba.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Pull rdma updates from Jason Gunthorpe:
"A new Pensando ionic driver, a new Gen 3 HW support for Intel irdma,
and lots of small bnxt_re improvements.
- Small bug fixes and improves to hfi1, efa, mlx5, erdma, rdmarvt,
siw
- Allow userspace access to IB service records through the rdmacm
- Optimize dma mapping for erdma
- Fix shutdown of the GSI QP in mana
- Support relaxed ordering MR and fix a corruption bug with mlx5 DMA
Data Direct
- Many improvement to bnxt_re:
- Debugging features and counters
- Improve performance of some commands
- Change flow_label reporting in completions
- Mirror vnic
- RDMA flow support
- New RDMA driver for Pensando Ethernet devices: ionic
- Gen 3 hardware support for the Intel irdma driver
- Fix rdma routing resolution with VRFs"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (85 commits)
RDMA/ionic: Fix memory leak of admin q_wr
RDMA/siw: Always report immediate post SQ errors
RDMA/bnxt_re: improve clarity in ALLOC_PAGE handler
RDMA/irdma: Remove unused struct irdma_cq fields
RDMA/irdma: Fix positive vs negative error codes in irdma_post_send()
RDMA/bnxt_re: Remove non-statistics counters from hw_counters
RDMA/bnxt_re: Add debugfs info entry for device and resource information
RDMA/bnxt_re: Fix incorrect errno used in function comments
RDMA: Use %pe format specifier for error pointers
RDMA/ionic: Use ether_addr_copy instead of memcpy
RDMA/ionic: Fix build failure on SPARC due to xchg() operand size
RDMA/rxe: Fix race in do_task() when draining
IB/sa: Fix sa_local_svc_timeout_ms read race
IB/ipoib: Ignore L3 master device
RDMA/core: Use route entry flag to decide on loopback traffic
RDMA/core: Resolve MAC of next-hop device without ARP support
RDMA/core: Squash a single user static function
RDMA/irdma: Update Kconfig
RDMA/irdma: Extend CQE Error and Flush Handling for GEN3 Devices
RDMA/irdma: Add Atomic Operations support
...
|
|
Convert error logging throughout the RDMA subsystem to use
the %pe format specifier instead of PTR_ERR() with integer
format specifiers.
Link: https://patch.msgid.link/e81ec02df1e474be20417fb62e779776e3f47a50.1758217936.git.leon@kernel.org
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
When computing the delta, the sa_local_svc_timeout_ms is read without
ib_nl_request_lock held. Though unlikely in practice, this can cause
a race condition if multiple local service threads are managing the
timeout.
Fixes: 2ca546b92a02 ("IB/sa: Route SA pathrecord query through netlink")
Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20250916163112.98414-1-edwards@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
addr_resolve() considers a destination to be local if the next-hop
device of the resolved route for the destination is the loopback
netdevice.
This fails when the source and destination IP addresses belong to
a netdev enslaved to a VRF netdev. In this case the next-hop device
is the VRF itself:
$ ip link add name myvrf up type vrf table 100
$ ip link set ens2f0np0 master myvrf up
$ ip addr add 192.168.1.1/24 dev ens2f0np0
$ ip route get 192.168.1.1 oif myvrf
local 192.168.1.1 dev myvrf table 100 src 192.168.1.1 uid 0
cache <local>
This results in packets being generated with an incorrect destination
MAC of the VRF netdevice and ib_write_bw failing with timeout.
Solve this by determining if a destination is local or not based on
the resolved route's type rather than based on its next-hop netdevice
loopback flag.
This enables to resolve loopback traffic with and without VRF
configurations in a uniform way.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20250916111103.84069-4-edwards@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently, if the next-hop netdevice does not support ARP resolution,
the destination MAC address is silently set to zero without reporting
an error. This leads to incorrect behavior and may result in packet
transmission failures.
Fix this by deferring MAC resolution to the IP stack via neighbour
lookup, allowing proper resolution or error reporting as appropriate.
Fixes: 7025fcd36bd6 ("IB: address translation to map IP toIB addresses (GIDs)")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20250916111103.84069-3-edwards@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
To reduce dependencies in IFF_LOOPBACK in route and neighbour resolution
steps, squash the static function to its single caller and simplify the
code.
Until now, network field was set even when neighbour resolution failed.
With this change, dev_addr output fields are valid only when resolution
is successful.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20250916111103.84069-2-edwards@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
When the destroy CM ID timeout kicks in, you typically get a storm of
them which creates a log flooding. Hence, change pr_err() to
pr_err_ratelimited() in cm_destroy_id_wait_timeout().
Fixes: 96d9cbe2f2ff ("RDMA/cm: add timeout to cm_destroy_id wait")
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Link: https://patch.msgid.link/20250912100525.531102-1-haakon.bugge@oracle.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Trivial fix to spelling mistake in comment text.
Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com>
Link: https://patch.msgid.link/20250827120007.489496-1-zhao.xichao@vivo.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Ensure that pfn_list allocated by kvcalloc() is freed using corresponding
kvfree() function. Match memory allocation and free routines kvcalloc -> kvfree.
Fixes: 259e9bd07c57 ("RDMA/core: Avoid hmm_dma_map_alloc() for virtual DMA devices")
Signed-off-by: Akhilesh Patil <akhilesh@ee.iitb.ac.in>
Link: https://patch.msgid.link/aJjcPjL1BVh8QrMN@bhairav-test.ee.iitb.ac.in
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Enable user-space to inject an event into a CM through it's event
channel. Two new events are added and supported: RDMA_CM_EVENT_USER and
RDMA_CM_EVENT_INTERNAL. With these 2 events a new event parameter "arg"
is supported, which is passed from sender to receiver transparently.
With this feature an application is able to write an event into a CM
channel with a new user-space rdmacm API. For example thread T1 could
write an event with the API:
rdma_write_cm_event(cm_id, RDMA_CM_EVENT_USER, status, arg);
and thread T2 could receive the event with rdma_get_cm_event().
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Link: https://patch.msgid.link/fdf49d0b17a45933c5d8c1d90605c9447d9a3c73.1751279794.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Enable user-space to query resolved service records through a ucma
command when a RDMA_CM_EVENT_ADDRINFO_RESOLVED event is received.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Link: https://patch.msgid.link/1090ee7c00c3f8058c4f9e7557de983504a16715.1751279794.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add new UCMA command and the corresponding CMA implementation. Userspace
can send this command to request service resolution based on service
name or ID.
On a successful resolution, one or multiple service records are
returned, the first one will be used as destination address by default.
Two new CM events are added and returned to caller accordingly:
- RDMA_CM_EVENT_ADDRINFO_RESOLVED: Resolve succeeded;
- RDMA_CM_EVENT_ADDRINFO_ERROR: Resolve failed.
Internally two new CM states are added:
- RDMA_CM_ADDRINFO_QUERY: CM is in the process of IB service
resolution;
- RDMA_CM_ADDRINFO_RESOLVED: CM has finished the resolve process.
With these new states, beside existing state transfer processes, 2 new
processes are supported:
1. The default address is used:
RDMA_CM_ADDR_BOUND ->
RDMA_CM_ADDRINFO_QUERY ->
RDMA_CM_ADDRINFO_RESOLVED ->
RDMA_CM_ROUTE_QUERY
2. To use a different address:
RDMA_CM_ADDR_BOUND ->
RDMA_CM_ADDRINFO_QUERY->
RDMA_CM_ADDRINFO_RESOLVED ->
RDMA_CM_ADDR_QUERY ->
RDMA_CM_ADDR_RESOLVED ->
RDMA_CM_ROUTE_QUERY
In the 2nd case, resolve_addrinfo returns multiple records, a user
could call rdma_resolve_addr() with the one that is not the first.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Link: https://patch.msgid.link/b6e82ad75522a13b5efe4ff86da0e465aab04cc2.1751279794.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add an SA query API ib_sa_service_rec_get() to support building and
sending SA query MADs that ask for service records with a specific
name or ID, and receiving and parsing responses from the SM.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Link: https://patch.msgid.link/9af6c82f3a3a9d975115a33235fb4ffc7c8edb21.1751279793.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Register GSI mad agent with RMPP support and add rmpp_callback for
SA queries. This is needed for querying more than one service record
in one query.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Link: https://patch.msgid.link/81dbcb48682e1838dc40f381cdcc0dc63f25f0f1.1751279793.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Pull rdma updates from Jason Gunthorpe:
- Various minor code cleanups and fixes for hns, iser, cxgb4, hfi1,
rxe, erdma, mana_ib
- Prefetch supprot for rxe ODP
- Remove memory window support from hns as new device FW is no longer
support it
- Remove qib, it is very old and obsolete now, Cornelis wishes to
restructure the hfi1/qib shared layer
- Fix a race in destroying CQs where we can still end up with work
running because the work is cancled before the driver stops
triggering it
- Improve interaction with namespaces:
* Follow the devlink namespace for newly spawned RDMA devices
* Create iopoib net devces in the parent IB device's namespace
* Allow CAP_NET_RAW checks to pass in user namespaces
- A new flow control scheme for IB MADs to try and avoid queue
overflows in the network
- Fix 2G message sizes in bnxt_re
- Optimize mkey layout for mlx5 DMABUF
- New "DMA Handle" concept to allow controlling PCI TPH and steering
tags
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (71 commits)
RDMA/siw: Change maintainer email address
RDMA/mana_ib: add support of multiple ports
RDMA/mlx5: Refactor optional counters steering code
RDMA/mlx5: Add DMAH support for reg_user_mr/reg_user_dmabuf_mr
IB: Extend UVERBS_METHOD_REG_MR to get DMAH
RDMA/mlx5: Add DMAH object support
RDMA/core: Introduce a DMAH object and its alloc/free APIs
IB/core: Add UVERBS_METHOD_REG_MR on the MR object
net/mlx5: Add support for device steering tag
net/mlx5: Expose IFC bits for TPH
PCI/TPH: Expose pcie_tph_get_st_table_size()
RDMA/mlx5: Fix incorrect MKEY masking
RDMA/mlx5: Fix returned type from _mlx5r_umr_zap_mkey()
RDMA/mlx5: remove redundant check on err on return expression
RDMA/mana_ib: add additional port counters
RDMA/mana_ib: Fix DSCP value in modify QP
RDMA/efa: Add CQ with external memory support
RDMA/core: Add umem "is_contiguous" and "start_dma_addr" helpers
RDMA/uverbs: Add a common way to create CQ with umem
RDMA/mlx5: Optimize DMABUF mkey page size
...
|
|
Extend UVERBS_METHOD_REG_MR to get DMAH and pass it to all drivers.
It will be used in mlx5 driver as part of the next patch from the
series.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/2ae1e628c0675db81f092cc00d3ad6fbf6139405.1752752567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Introduce a new DMA handle (DMAH) object along with its corresponding
allocation and deallocation APIs.
This DMAH object encapsulates attributes intended for use in DMA
transactions.
While its initial purpose is to support TPH functionality, it is
designed to be extensible for future features such as DMA PCI multipath,
PCI UIO configurations, PCI traffic class selection, and more.
Further details:
----------------
We ensure that a caller requesting a DMA handle for a specific CPU ID is
permitted to be scheduled on it. This prevent a potential security issue
where a non privilege user may trigger DMA operations toward a CPU that
it's not allowed to run on.
We manage reference counting for the DMAH object and its consumers
(e.g., memory regions) as will be detailed in subsequent patches in the
series.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/2cad097e849597e49d6b61e6865dba878257f371.1752752567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
This new method enables us to use a single ioctl from user space which
supports the below variants of reg_mr [1].
The method will be extended in the next patches from the series with an
extra attribute to let us pass DMA handle to be used as part of the
registration.
[1] ibv_reg_mr(), ibv_reg_mr_iova(), ibv_reg_mr_iova2(),
ibv_reg_dmabuf_mr().
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/5a3822ceef084efe967c9752e89c58d8250337c7.1752752567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add ioctl command attributes and a common handling for the option to
create CQs with memory buffers passed from userspace. When required
attributes are supplied, create umem and provide it for driver's use.
The extension enables creation of CQs on top of preallocated CPU
virtual or device memory buffers, by supplying VA or dmabuf fd, in a
common way.
Drivers can support this flow by initializing a new create_cq_umem fp
field in their ops struct, with a function that can handle the new
parameter.
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Link: https://patch.msgid.link/20250708202308.24783-2-mrgolin@amazon.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Most responses (e.g., RTU) are not subject to flow control, as there is
no further response expected. However, REPs are both requests (waiting
for RTUs) and responses (being waited by REQs).
With agent-level flow control added to the MAD layer, REPs can get
delayed by outstanding REQs. This can cause a problem in a scenario
such as 2 hosts connecting to each other at the same time. Both hosts
fill the flow control outstanding slots with REQs. The corresponding
REPs are now blocked behind those REQs, and neither side can make
progress until REQs time out.
Add a separate MAD agent which is only used to send REPs. This agent
does not have a recv_handler as it doesn't process responses nor does it
register to receive requests. Disable flow control for agents w/o a
recv_handler, as they aren't waiting for responses. This allows the
newly added REP agent to send even when clients are slow to generate
RTU, which would be needed to unblock flow control outstanding slots.
Relax check in ib_post_send_mad to allow retries for this agent. REPs
will be retried by the MAD layer until CM layer receives a response
(e.g., RTU) on the normal agent and cancels them.
Suggested-by: Sean Hefty <shefty@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Sean Hefty <shefty@nvidia.com>
Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Link: https://patch.msgid.link/9ac12d0842b849e2c8537d6e291ee0af9f79855c.1751278420.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently, MADs sent via an agent are being forwarded directly to the
corresponding MAD QP layer.
MADs with a timeout value set and requiring a response (solicited MADs)
will be resent if the timeout expires without receiving a response.
In a congested subnet, flooding MAD QP layer with more solicited send
requests from the agent will only worsen the situation by triggering
more timeouts and therefore more retries.
Thus, add flow control for non-user solicited MADs to block agents from
issuing new solicited MAD requests to the MAD QP until outstanding
requests are completed and the MAD QP is ready to process additional
requests. While at it, keep track of the total outstanding solicited
MAD work requests in send or wait list. The number of outstanding send
WRs will be limited by a fraction of the RQ size, and any new send WR
that exceeds that limit will be held in a backlog list.
Backlog MADs will be forwarded to agent send list only once the total
number of outstanding send WRs falls below the limit.
Unsolicited MADs, RMPP MADs and MADs which are not SA, SMP or CM are
not subject to this flow control mechanism and will not be affected by
this change.
For this purpose, a new state is introduced:
- 'IB_MAD_STATE_QUEUED': MAD is in backlog list
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Link: https://patch.msgid.link/c0ecaa1821badee124cd13f3bf860f67ce453beb.1751278420.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Replace the use of refcount, timeout and status with a 'state'
field to track the status of MADs send work requests (WRs).
The state machine better represents the stages in the MAD lifecycle,
specifically indicating whether the MAD is waiting for a completion,
waiting for a response, was canceld or is done.
The existing refcount only takes two values:
1 : MAD is waiting either for completion or for response.
2 : MAD is waiting for both response and completion. Also when a
response was received before a completion notification.
The status field represents if the MAD was canceled at some point
in the flow.
The timeout is used to represent if a response was received.
The current state transitions are not clearly visible, and developers
needs to infer the state from the refcount's, timeout's or status's
value, which is error-prone and difficult to follow.
Thus, replace with a state machine as the following:
- 'IB_MAD_STATE_INIT': MAD is in the making and is not yet in any list
- 'IB_MAD_STATE_SEND_START': MAD was sent to the QP and is waiting for
completion notification in send list
- 'IB_MAD_STATE_WAIT_RESP': MAD send completed successfully, waiting for
a response in wait list
- 'IB_MAD_STATE_EARLY_RESP': Response came early, before send
completion notification, MAD is in the send list
- 'IB_MAD_STATE_CANCELED': MAD was canceled while in send or wait list
- 'IB_MAD_STATE_DONE': MAD processing completed, MAD is in no list
Adding the state machine also make it possible to remove the double
call for ib_mad_complete_send_wr in case of an early response and the
use of a done list in case of a regular response.
While at it, define a helper to clear error MADs which will handle
freeing MADs that timed out or have been cancelled.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Link: https://patch.msgid.link/48e6ae8689dc7bb8b4ba6e5ec562e1b018db88a8.1751278420.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails.
Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.
Fixes: 1bd8e0a9d0fd ("RDMA/counter: Allow manual mode configuration support")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/68e2064e72e94558a576fdbbb987681a64f6fea8.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to modify
the QP.
Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.
Fixes: 0cadb4db79e1 ("RDMA/uverbs: Restrict usage of privileged QKEYs")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/099eb263622ccdd27014db7e02fec824a3307829.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the QP.
Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/3914ef9702b01de8843a391ce397fca67d0fc7af.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the QP.
Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.
Fixes: 6d1e7ba241e9 ("IB/uverbs: Introduce create/destroy QP commands over ioctl")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/7b6b87505ccc28a1f7b4255af94d898d2df0fff5.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the QP.
Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.
Fixes: 2dee0e545894 ("IB/uverbs: Enable QP creation with a given source QP number")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/0e5920d1dfe836817bb07576b192da41b637130b.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the flow resource.
Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.
Fixes: 436f2ad05a0b ("IB/core: Export ib_create/destroy_flow through uverbs")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
Link: https://patch.msgid.link/6df6f2f24627874c4f6d041c19dc1f6f29f68f84.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Presently, RDMA devices are always registered within the init network
namespace, even if the associated devlink device's namespace was
changed via a devlink reload. This mismatch leads to discrepancies
between the network namespace of the devlink device and that of the
RDMA device.
Therefore, extend the RDMA device allocation API to optionally take
the net namespace. This isn't limited to devices that support devlink
but allows all users to provide the network namespace if they need to
do so.
If a network namespace is provided during device allocation, it's up
to the caller to make sure the namespace stays valid until
ib_register_device() is called.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
In the s390 defconfig, gcc-10 and earlier end up inlining three functions
into nldev_stat_get_doit(), and each of them uses some 600 bytes of stack.
The result is a function with an overly large stack frame and a warning:
drivers/infiniband/core/nldev.c:2466:1: error: the frame size of 1720 bytes is larger than 1280 bytes [-Werror=frame-larger-than=]
Mark the three functions noinline_for_stack to prevent this, ensuring
that only one copy of the nlattr array is on the stack of each function.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20250620113335.3776965-1-arnd@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently in ib_free_cq, it disables IRQ or cancel the CQ work before
driver destroy_cq. This isn't good as a new IRQ or a CQ work can be
submitted immediately after disabling IRQ or canceling CQ work, which
may run concurrently with destroy_cq and cause crashes.
The right flow should be:
1. Driver disables CQ to make sure no new CQ event will be submitted;
2. Disables IRQ or Cancels CQ work in core layer, to make sure no CQ
polling work is running;
3. Free all resources to destroy the CQ.
This patch adds 2 driver APIs:
- pre_destroy_cq(): Disable a CQ to prevent it from generating any new
work completions, but not free any kernel resources;
- post_destroy_cq(): Free all kernel resources.
In ib_free_cq, the IRQ is disabled or CQ work is canceled after
pre_destroy_cq, and before post_destroy_cq.
Fixes: 14d3a3b2498e ("IB: add a proper completion queue abstraction")
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Link: https://patch.msgid.link/b5f7ae3d75f44a3e15ff3f4eb2bbdea13e06b97f.1750062328.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Following the fix in the previous commit ("IB/mlx5: Fix potential
deadlock in MR deregistration"), teach lockdep explicitly about the
locking order between fs_reclaim and umem_mutex.
The previous commit resolved a potential deadlock scenario where
kzalloc(GFP_KERNEL) was called while holding umem_mutex, which could
lead to reclaim and eventually invoke the MMU notifier
(mlx5_ib_invalidate_range()), causing a recursive acquisition of
umem_mutex.
To prevent such issues from reoccurring unnoticed in future code
changes, add a lockdep annotation in ib_init_umem_odp() that simulates
taking umem_mutex inside a reclaim context. This makes lockdep aware
of this locking dependency and ensures that future violations—such as
calling kzalloc() or any memory allocator that may enter reclaim while
holding umem_mutex—will immediately raise a lockdep warning.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/9d31b9d8fe1db648a9f47cec3df6b8463319dee5.1750061698.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
The GID cache warning messages can flood the kernel log when there are
multiple failed attempts to add GIDs. This can happen when creating many
virtual interfaces without having enough space for their GIDs in the GID
table.
Change pr_warn to pr_warn_ratelimited to prevent log flooding while still
maintaining visibility of the issue.
Link: https://patch.msgid.link/r/fd45ed4a1078e743f498b234c3ae816610ba1b18.1750062357.git.leon@kernel.org
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The cited commit fixed a crash when cma_netevent_callback was called for
a cma_id while work on that id from a previous call had not yet started.
The work item was re-initialized in the second call, which corrupted the
work item currently in the work queue.
However, it left a problem when queue_work fails (because the item is
still pending in the work queue from a previous call). In this case,
cma_id_put (which is called in the work handler) is therefore not
called. This results in a userspace process hang (zombie process).
Fix this by calling cma_id_put() if queue_work fails.
Fixes: 45f5dcdd0497 ("RDMA/cma: Fix workqueue crash in cma_netevent_work_handler")
Link: https://patch.msgid.link/r/4f3640b501e48d0166f312a64fdadf72b059bd04.1747827103.git.leon@kernel.org
Signed-off-by: Jack Morgenstein <jackm@nvidia.com>
Signed-off-by: Feng Liu <feliu@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Sharath Srinivasan <sharath.srinivasan@oracle.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Following patches need the RDMA rc branch since we are past the RC cycle
now.
Merge conflicts resolved based on Linux-next:
- For RXE odp changes keep for-next version and fixup new places that
need to call is_odp_mr()
https://lore.kernel.org/r/20250422143019.500201bd@canb.auug.org.au
https://lore.kernel.org/r/20250514122455.3593b083@canb.auug.org.au
- irdma is keeping the while/kfree bugfix from -rc and the pf/cdev_info
change from for-next
https://lore.kernel.org/r/20250513130630.280ee6c5@canb.auug.org.au
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Drop ib_send_cm_mra parameters which are always constant. Remove branch
which is never taken. Adjust name to ib_prepare_cm_mra, which better
reflects its functionality - no MRA is actually sent. Adjust name of
related tracepoints. Push setting of the constant service timeout to
cm.c and drop IB_CM_MRA_FLAG_DELAY.
Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Reviewed-by: Sean Hefty <shefty@nvidia.com>
Link: https://patch.msgid.link/cdd2a237acf2b495c19ce02e4b1c42c41c6751c2.1747827207.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Drivers such as rxe, which use virtual DMA, must not call into the DMA
mapping core since they lack physical DMA capabilities. Otherwise, a NULL
pointer dereference is observed as shown below. This patch ensures the RDMA
core handles virtual and physical DMA paths appropriately.
This fixes the following kernel oops:
BUG: kernel NULL pointer dereference, address: 00000000000002fc
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 1028eb067 P4D 1028eb067 PUD 105da0067 PMD 0
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 3 UID: 1000 PID: 1854 Comm: python3 Tainted: G W 6.15.0-rc1+ #11 PREEMPT(voluntary)
Tainted: [W]=WARN
Hardware name: Trigkey Key N/Key N, BIOS KEYN101 09/02/2024
RIP: 0010:hmm_dma_map_alloc+0x25/0x100
Code: 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 49 89 d6 49 c1 e6 0c 41 55 41 54 53 49 39 ce 0f 82 c6 00 00 00 49 89 fc <f6> 87 fc 02 00 00 20 0f 84 af 00 00 00 49 89 f5 48 89 d3 49 89 cf
RSP: 0018:ffffd3d3420eb830 EFLAGS: 00010246
RAX: 0000000000001000 RBX: ffff8b727c7f7400 RCX: 0000000000001000
RDX: 0000000000000001 RSI: ffff8b727c7f74b0 RDI: 0000000000000000
RBP: ffffd3d3420eb858 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 00007262a622a000 R14: 0000000000001000 R15: ffff8b727c7f74b0
FS: 00007262a62a1080(0000) GS:ffff8b762ac3e000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000002fc CR3: 000000010a1f0004 CR4: 0000000000f72ef0
PKRU: 55555554
Call Trace:
<TASK>
ib_init_umem_odp+0xb6/0x110 [ib_uverbs]
ib_umem_odp_get+0xf0/0x150 [ib_uverbs]
rxe_odp_mr_init_user+0x71/0x170 [rdma_rxe]
rxe_reg_user_mr+0x217/0x2e0 [rdma_rxe]
ib_uverbs_reg_mr+0x19e/0x2e0 [ib_uverbs]
ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xd9/0x150 [ib_uverbs]
ib_uverbs_cmd_verbs+0xd19/0xee0 [ib_uverbs]
? mmap_region+0x63/0xd0
? __pfx_ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x10/0x10 [ib_uverbs]
ib_uverbs_ioctl+0xba/0x130 [ib_uverbs]
__x64_sys_ioctl+0xa4/0xe0
x64_sys_call+0x1178/0x2660
do_syscall_64+0x7e/0x170
? syscall_exit_to_user_mode+0x4e/0x250
? do_syscall_64+0x8a/0x170
? do_syscall_64+0x8a/0x170
? syscall_exit_to_user_mode+0x4e/0x250
? do_syscall_64+0x8a/0x170
? syscall_exit_to_user_mode+0x4e/0x250
? do_syscall_64+0x8a/0x170
? do_user_addr_fault+0x1d2/0x8d0
? irqentry_exit_to_user_mode+0x43/0x250
? irqentry_exit+0x43/0x50
? exc_page_fault+0x93/0x1d0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7262a6124ded
Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
RSP: 002b:00007fffd08c3960 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fffd08c39f0 RCX: 00007262a6124ded
RDX: 00007fffd08c3a10 RSI: 00000000c0181b01 RDI: 0000000000000007
RBP: 00007fffd08c39b0 R08: 0000000014107820 R09: 00007fffd08c3b44
R10: 000000000000000c R11: 0000000000000246 R12: 00007fffd08c3b44
R13: 000000000000000c R14: 00007fffd08c3b58 R15: 0000000014107960
</TASK>
Fixes: 1efe8c0670d6 ("RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage")
Closes: https://lore.kernel.org/all/3e8f343f-7d66-4f7a-9f08-3910623e322f@gmail.com/
Signed-off-by: Daisuke Matsuda <dskmtsd@gmail.com>
Link: https://patch.msgid.link/20250524144328.4361-1-dskmtsd@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
The commit 59c68ac31e15 ("iw_cm: free cm_id resources on the last
deref") simplified cm_id resource management by freeing cm_id once all
references to the cm_id were removed. The references are removed either
upon completion of iw_cm event handlers or when the application destroys
the cm_id. This commit introduced the use-after-free condition where
cm_id_private object could still be in use by event handler works during
the destruction of cm_id. The commit aee2424246f9 ("RDMA/iwcm: Fix a
use-after-free related to destroying CM IDs") addressed this use-after-
free by flushing all pending works at the cm_id destruction.
However, still another use-after-free possibility remained. It happens
with the work objects allocated for each cm_id_priv within
alloc_work_entries() during cm_id creation, and subsequently freed in
dealloc_work_entries() once all references to the cm_id are removed.
If the cm_id's last reference is decremented in the event handler work,
the work object for the work itself gets removed, and causes the use-
after-free BUG below:
BUG: KASAN: slab-use-after-free in __pwq_activate_work+0x1ff/0x250
Read of size 8 at addr ffff88811f9cf800 by task kworker/u16:1/147091
CPU: 2 UID: 0 PID: 147091 Comm: kworker/u16:1 Not tainted 6.15.0-rc2+ #27 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
Workqueue: 0x0 (iw_cm_wq)
Call Trace:
<TASK>
dump_stack_lvl+0x6a/0x90
print_report+0x174/0x554
? __virt_addr_valid+0x208/0x430
? __pwq_activate_work+0x1ff/0x250
kasan_report+0xae/0x170
? __pwq_activate_work+0x1ff/0x250
__pwq_activate_work+0x1ff/0x250
pwq_dec_nr_in_flight+0x8c5/0xfb0
process_one_work+0xc11/0x1460
? __pfx_process_one_work+0x10/0x10
? assign_work+0x16c/0x240
worker_thread+0x5ef/0xfd0
? __pfx_worker_thread+0x10/0x10
kthread+0x3b0/0x770
? __pfx_kthread+0x10/0x10
? rcu_is_watching+0x11/0xb0
? _raw_spin_unlock_irq+0x24/0x50
? rcu_is_watching+0x11/0xb0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x30/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Allocated by task 147416:
kasan_save_stack+0x2c/0x50
kasan_save_track+0x10/0x30
__kasan_kmalloc+0xa6/0xb0
alloc_work_entries+0xa9/0x260 [iw_cm]
iw_cm_connect+0x23/0x4a0 [iw_cm]
rdma_connect_locked+0xbfd/0x1920 [rdma_cm]
nvme_rdma_cm_handler+0x8e5/0x1b60 [nvme_rdma]
cma_cm_event_handler+0xae/0x320 [rdma_cm]
cma_work_handler+0x106/0x1b0 [rdma_cm]
process_one_work+0x84f/0x1460
worker_thread+0x5ef/0xfd0
kthread+0x3b0/0x770
ret_from_fork+0x30/0x70
ret_from_fork_asm+0x1a/0x30
Freed by task 147091:
kasan_save_stack+0x2c/0x50
kasan_save_track+0x10/0x30
kasan_save_free_info+0x37/0x60
__kasan_slab_free+0x4b/0x70
kfree+0x13a/0x4b0
dealloc_work_entries+0x125/0x1f0 [iw_cm]
iwcm_deref_id+0x6f/0xa0 [iw_cm]
cm_work_handler+0x136/0x1ba0 [iw_cm]
process_one_work+0x84f/0x1460
worker_thread+0x5ef/0xfd0
kthread+0x3b0/0x770
ret_from_fork+0x30/0x70
ret_from_fork_asm+0x1a/0x30
Last potentially related work creation:
kasan_save_stack+0x2c/0x50
kasan_record_aux_stack+0xa3/0xb0
__queue_work+0x2ff/0x1390
queue_work_on+0x67/0xc0
cm_event_handler+0x46a/0x820 [iw_cm]
siw_cm_upcall+0x330/0x650 [siw]
siw_cm_work_handler+0x6b9/0x2b20 [siw]
process_one_work+0x84f/0x1460
worker_thread+0x5ef/0xfd0
kthread+0x3b0/0x770
ret_from_fork+0x30/0x70
ret_from_fork_asm+0x1a/0x30
This BUG is reproducible by repeating the blktests test case nvme/061
for the rdma transport and the siw driver.
To avoid the use-after-free of cm_id_private work objects, ensure that
the last reference to the cm_id is decremented not in the event handler
works, but in the cm_id destruction context. For that purpose, move
iwcm_deref_id() call from destroy_cm_id() to the callers of
destroy_cm_id(). In iw_destroy_cm_id(), call iwcm_deref_id() after
flushing the pending works.
During the fix work, I noticed that iw_destroy_cm_id() is called from
cm_work_handler() and process_event() context. However, the comment of
iw_destroy_cm_id() notes that the function "cannot be called by the
event thread". Drop the false comment.
Closes: https://lore.kernel.org/linux-rdma/r5676e754sv35aq7cdsqrlnvyhiq5zktteaurl7vmfih35efko@z6lay7uypy3c/
Fixes: 59c68ac31e15 ("iw_cm: free cm_id resources on the last deref")
Cc: stable@vger.kernel.org
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://patch.msgid.link/20250510101036.1756439-1-shinichiro.kawasaki@wdc.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Create separate functions for the implicit ODP initialization
which is different from the explicit ODP initialization.
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
Reuse newly added DMA API to cache IOVA and only link/unlink pages
in fast path for UMEM ODP flow.
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|