linux-toradex.git - Linux kernel for Apalis and Colibri modules

Age	Commit message (Collapse)	Author
7 days	IB/rxe: Fix missing umem_odp->umem_mutex unlock on error path	Li Zhijian
	rxe_odp_map_range_and_lock() must release umem_odp->umem_mutex when an error occurs, including cases where rxe_check_pagefault() fails. Fixes: 2fae67ab63db ("RDMA/rxe: Add support for Send/Recv/Write/Read with ODP") Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Link: https://patch.msgid.link/20251226094112.3042583-1-lizhijian@fujitsu.com Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-12-21	RDMA/rxe: let rxe_reclassify_recv_socket() call sk_owner_put()	Stefan Metzmacher
	On kernels build with CONFIG_PROVE_LOCKING, CONFIG_MODULES and CONFIG_DEBUG_LOCK_ALLOC 'rmmod rdma_rxe' is no longer possible. For the global recv sockets rxe_net_exit() is where we call rxe_release_udp_tunnel-> udp_tunnel_sock_release(), which means the sockets are destroyed before 'rmmod rdma_rxe' finishes, so there's no need to protect against rxe_recv_slock_key and rxe_recv_sk_key disappearing while the sockets are still alive. Fixes: 80a85a771deb ("RDMA/rxe: reclassify sockets in order to avoid false positives from lockdep") Cc: Zhu Yanjun <zyjzyj2000@gmail.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Cc: linux-rdma@vger.kernel.org Cc: netdev@vger.kernel.org Cc: linux-cifs@vger.kernel.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Link: https://patch.msgid.link/20251219140408.2300163-1-metze@samba.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-12-04	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma	Linus Torvalds
	Pull rdma updates from Jason Gunthorpe: "This has another new RDMA driver 'bng_en' for latest generation Broadcom NICs. There might be one more new driver still to come. Otherwise it is a fairly quite cycle. Summary: - Minor driver bug fixes and updates to cxgb4, rxe, rdmavt, bnxt_re, mlx5 - Many bug fix patches for irdma - WQ_PERCPU annotations and system_dfl_wq changes - Improved mlx5 support for "other eswitches" and multiple PFs - 1600Gbps link speed reporting support. Four Digits Now! - New driver bng_en for latest generation Broadcom NICs - Bonding support for hns - Adjust mlx5's hmm based ODP to work with the very large address space created by the new 5 level paging default on x86 - Lockdep fixups in rxe and siw" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (65 commits) RDMA/rxe: reclassify sockets in order to avoid false positives from lockdep RDMA/siw: reclassify sockets in order to avoid false positives from lockdep RDMA/bng_re: Remove prefetch instruction RDMA/core: Reduce cond_resched() frequency in __ib_umem_release RDMA/irdma: Fix SRQ shadow area address initialization RDMA/irdma: Remove doorbell elision logic RDMA/irdma: Do not set IBK_LOCAL_DMA_LKEY for GEN3+ RDMA/irdma: Do not directly rely on IB_PD_UNSAFE_GLOBAL_RKEY RDMA/irdma: Add missing mutex destroy RDMA/irdma: Fix SIGBUS in AEQ destroy RDMA/irdma: Add a missing kfree of struct irdma_pci_f for GEN2 RDMA/irdma: Fix data race in irdma_free_pble RDMA/irdma: Fix data race in irdma_sc_ccq_arm RDMA/mlx5: Add support for 1600_8x lane speed RDMA/core: Add new IB rate for XDR (8x) support IB/mlx5: Reduce IMR KSM size when 5-level paging is enabled RDMA/bnxt_re: Pass correct flag for dma mr creation RDMA/bnxt_re: Fix the inline size for GenP7 devices RDMA/hns: Support reset recovery for bond RDMA/hns: Support link state reporting for bond ...
2025-11-27	RDMA/rxe: reclassify sockets in order to avoid false positives from lockdep	Stefan Metzmacher
	While developing IPPROTO_SMBDIRECT support for the code under fs/smb/common/smbdirect [1], I noticed false positives like this: [+0,003927] ============================================ [+0,000532] WARNING: possible recursive locking detected [+0,000611] 6.18.0-rc5-metze-kasan-lockdep.02+ #1 Tainted: G OE [+0,000835] -------------------------------------------- [+0,000729] ksmbd:r5445/3609 is trying to acquire lock: [+0,000709] ffff88800b9570f8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: inet_shutdown+0x52/0x360 [+0,000831] but task is already holding lock: [+0,000684] ffff88800654af78 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: smbdirect_sk_close+0x122/0x790 [smbdirect] [+0,000928] other info that might help us debug this: [+0,005552] Possible unsafe locking scenario: [+0,000723] CPU0 [+0,000359] ---- [+0,000377] lock(k-sk_lock-AF_INET); [+0,000478] lock(k-sk_lock-AF_INET); [+0,000498] * DEADLOCK * [+0,001012] May be due to missing lock nesting notation [+0,000831] 3 locks held by ksmbd:r5445/3609: [+0,000484] #0: ffff88800654af78 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: smbdirect_sk_close+0x122/0x790 [smbdirect] [+0,001000] #1: ffff888020a40458 (&id_priv->handler_mutex){+.+.}-{4:4}, at: rdma_lock_handler+0x17/0x30 [rdma_cm] [+0,000982] #2: ffff888020a40350 (&id_priv->qp_mutex){+.+.}-{4:4}, at: rdma_destroy_qp+0x5d/0x1f0 [rdma_cm] [+0,000934] stack backtrace: [+0,000589] CPU: 0 UID: 0 PID: 3609 Comm: ksmbd:r5445 Kdump: loaded Tainted: G OE 6.18.0-rc5-metze-kasan-lockdep.02+ #1 PREEMPT(voluntary) [+0,000023] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE [+0,000004] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 ... [+0,000010] print_deadlock_bug+0x245/0x330 [+0,000014] validate_chain+0x32a/0x590 [+0,000012] __lock_acquire+0x535/0xc30 [+0,000013] lock_acquire.part.0+0xb3/0x240 [+0,000017] ? inet_shutdown+0x52/0x360 [+0,000013] ? srso_alias_return_thunk+0x5/0xfbef5 [+0,000007] ? mark_held_locks+0x46/0x90 [+0,000012] lock_acquire+0x60/0x140 [+0,000006] ? inet_shutdown+0x52/0x360 [+0,000028] lock_sock_nested+0x3b/0xf0 [+0,000009] ? inet_shutdown+0x52/0x360 [+0,000008] inet_shutdown+0x52/0x360 [+0,000010] kernel_sock_shutdown+0x5b/0x90 [+0,000011] rxe_qp_do_cleanup+0x4ef/0x810 [rdma_rxe] [+0,000043] ? __pfx_rxe_qp_do_cleanup+0x10/0x10 [rdma_rxe] [+0,000030] execute_in_process_context+0x2b/0x170 [+0,000013] rxe_qp_cleanup+0x1c/0x30 [rdma_rxe] [+0,000021] __rxe_cleanup+0x1cf/0x2e0 [rdma_rxe] [+0,000036] ? __pfx___rxe_cleanup+0x10/0x10 [rdma_rxe] [+0,000020] ? srso_alias_return_thunk+0x5/0xfbef5 [+0,000006] ? __kasan_check_read+0x11/0x20 [+0,000012] rxe_destroy_qp+0xe1/0x230 [rdma_rxe] [+0,000035] ib_destroy_qp_user+0x217/0x450 [ib_core] [+0,000074] rdma_destroy_qp+0x83/0x1f0 [rdma_cm] [+0,000034] smbdirect_connection_destroy_qp+0x98/0x2e0 [smbdirect] [+0,000017] ? __pfx_smb_direct_logging_needed+0x10/0x10 [ksmbd] [+0,000044] smbdirect_connection_destroy+0x698/0xed0 [smbdirect] [+0,000023] ? __pfx_smbdirect_connection_destroy+0x10/0x10 [smbdirect] [+0,000033] ? __pfx_smb_direct_logging_needed+0x10/0x10 [ksmbd] [+0,000031] smbdirect_connection_destroy_sync+0x42b/0x9f0 [smbdirect] [+0,000029] ? mark_held_locks+0x46/0x90 [+0,000012] ? __pfx_smbdirect_connection_destroy_sync+0x10/0x10 [smbdirect] [+0,000019] ? srso_alias_return_thunk+0x5/0xfbef5 [+0,000007] ? trace_hardirqs_on+0x64/0x70 [+0,000029] ? srso_alias_return_thunk+0x5/0xfbef5 [+0,000010] ? srso_alias_return_thunk+0x5/0xfbef5 [+0,000006] ? __smbdirect_connection_schedule_disconnect+0x339/0x4b0 [+0,000021] smbdirect_sk_destroy+0xb0/0x680 [smbdirect] [+0,000024] ? srso_alias_return_thunk+0x5/0xfbef5 [+0,000006] ? trace_hardirqs_on+0x64/0x70 [+0,000006] ? srso_alias_return_thunk+0x5/0xfbef5 [+0,000005] ? __local_bh_enable_ip+0xba/0x150 [+0,000011] sk_common_release+0x66/0x340 [+0,000010] smbdirect_sk_close+0x12a/0x790 [smbdirect] [+0,000023] ? ip_mc_drop_socket+0x1e/0x240 [+0,000013] inet_release+0x10a/0x240 [+0,000011] smbdirect_sock_release+0x502/0xe80 [smbdirect] [+0,000015] ? srso_alias_return_thunk+0x5/0xfbef5 [+0,000024] sock_release+0x91/0x1c0 [+0,000010] smb_direct_free_transport+0x31/0x50 [ksmbd] [+0,000025] ksmbd_conn_free+0x1d0/0x240 [ksmbd] [+0,000040] smb_direct_disconnect+0xb2/0x120 [ksmbd] [+0,000023] ? srso_alias_return_thunk+0x5/0xfbef5 [+0,000018] ksmbd_conn_handler_loop+0x94e/0xf10 [ksmbd] ... I'll also add reclassify to the smbdirect socket code [1], but I think it's better to have it in both direction (below and above the RDMA layer). [1] https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/master-ipproto-smbdirect Cc: Zhu Yanjun <zyjzyj2000@gmail.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: linux-rdma@vger.kernel.org Cc: netdev@vger.kernel.org Cc: linux-cifs@vger.kernel.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Link: https://patch.msgid.link/20251127105614.2040922-1-metze@samba.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-27	RDMA/siw: reclassify sockets in order to avoid false positives from lockdep	Stefan Metzmacher
	While developing IPPROTO_SMBDIRECT support for the code under fs/smb/common/smbdirect [1], I noticed false positives like this: [T79] ====================================================== [T79] WARNING: possible circular locking dependency detected [T79] 6.18.0-rc4-metze-kasan-lockdep.01+ #1 Tainted: G OE [T79] ------------------------------------------------------ [T79] kworker/2:0/79 is trying to acquire lock: [T79] ffff88801f968278 (sk_lock-AF_INET){+.+.}-{0:0}, at: sock_set_reuseaddr+0x14/0x70 [T79] but task is already holding lock: [T79] ffffffffc10f7230 (lock#9){+.+.}-{4:4}, at: rdma_listen+0x3d2/0x740 [rdma_cm] [T79] which lock already depends on the new lock. [T79] the existing dependency chain (in reverse order) is: [T79] -> #1 (lock#9){+.+.}-{4:4}: [T79] __lock_acquire+0x535/0xc30 [T79] lock_acquire.part.0+0xb3/0x240 [T79] lock_acquire+0x60/0x140 [T79] __mutex_lock+0x1af/0x1c10 [T79] mutex_lock_nested+0x1b/0x30 [T79] cma_get_port+0xba/0x7d0 [rdma_cm] [T79] rdma_bind_addr_dst+0x598/0x9a0 [rdma_cm] [T79] cma_bind_addr+0x107/0x320 [rdma_cm] [T79] rdma_resolve_addr+0xa3/0x830 [rdma_cm] [T79] destroy_lease_table+0x12b/0x420 [ksmbd] [T79] ksmbd_NTtimeToUnix+0x3e/0x80 [ksmbd] [T79] ndr_encode_posix_acl+0x6e9/0xab0 [ksmbd] [T79] ndr_encode_v4_ntacl+0x53/0x870 [ksmbd] [T79] __sys_connect_file+0x131/0x1c0 [T79] __sys_connect+0x111/0x140 [T79] __x64_sys_connect+0x72/0xc0 [T79] x64_sys_call+0xe7d/0x26a0 [T79] do_syscall_64+0x93/0xff0 [T79] entry_SYSCALL_64_after_hwframe+0x76/0x7e [T79] -> #0 (sk_lock-AF_INET){+.+.}-{0:0}: [T79] check_prev_add+0xf3/0xcd0 [T79] validate_chain+0x466/0x590 [T79] __lock_acquire+0x535/0xc30 [T79] lock_acquire.part.0+0xb3/0x240 [T79] lock_acquire+0x60/0x140 [T79] lock_sock_nested+0x3b/0xf0 [T79] sock_set_reuseaddr+0x14/0x70 [T79] siw_create_listen+0x145/0x1540 [siw] [T79] iw_cm_listen+0x313/0x5b0 [iw_cm] [T79] cma_iw_listen+0x271/0x3c0 [rdma_cm] [T79] rdma_listen+0x3b1/0x740 [rdma_cm] [T79] cma_listen_on_dev+0x46a/0x750 [rdma_cm] [T79] rdma_listen+0x4b0/0x740 [rdma_cm] [T79] ksmbd_rdma_init+0x12b/0x270 [ksmbd] [T79] ksmbd_conn_transport_init+0x26/0x70 [ksmbd] [T79] server_ctrl_handle_work+0x1e5/0x280 [ksmbd] [T79] process_one_work+0x86c/0x1930 [T79] worker_thread+0x6f0/0x11f0 [T79] kthread+0x3ec/0x8b0 [T79] ret_from_fork+0x314/0x400 [T79] ret_from_fork_asm+0x1a/0x30 [T79] other info that might help us debug this: [T79] Possible unsafe locking scenario: [T79] CPU0 CPU1 [T79] ---- ---- [T79] lock(lock#9); [T79] lock(sk_lock-AF_INET); [T79] lock(lock#9); [T79] lock(sk_lock-AF_INET); [T79] * DEADLOCK * [T79] 5 locks held by kworker/2:0/79: [T79] #0: ffff88800120b158 ((wq_completion)events_long){+.+.}-{0:0}, at: process_one_work+0xfca/0x1930 [T79] #1: ffffc9000474fd00 ((work_completion)(&ctrl->ctrl_work)) {+.+.}-{0:0}, at: process_one_work+0x804/0x1930 [T79] #2: ffffffffc11307d0 (ctrl_lock){+.+.}-{4:4}, at: server_ctrl_handle_work+0x21/0x280 [ksmbd] [T79] #3: ffffffffc11347b0 (init_lock){+.+.}-{4:4}, at: ksmbd_conn_transport_init+0x18/0x70 [ksmbd] [T79] #4: ffffffffc10f7230 (lock#9){+.+.}-{4:4}, at: rdma_listen+0x3d2/0x740 [rdma_cm] [T79] stack backtrace: [T79] CPU: 2 UID: 0 PID: 79 Comm: kworker/2:0 Kdump: loaded Tainted: G OE 6.18.0-rc4-metze-kasan-lockdep.01+ #1 PREEMPT(voluntary) [T79] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE [T79] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [T79] Workqueue: events_long server_ctrl_handle_work [ksmbd] ... [T79] print_circular_bug+0xfd/0x130 [T79] check_noncircular+0x150/0x170 [T79] check_prev_add+0xf3/0xcd0 [T79] validate_chain+0x466/0x590 [T79] __lock_acquire+0x535/0xc30 [T79] ? srso_alias_return_thunk+0x5/0xfbef5 [T79] lock_acquire.part.0+0xb3/0x240 [T79] ? sock_set_reuseaddr+0x14/0x70 [T79] ? srso_alias_return_thunk+0x5/0xfbef5 [T79] ? __kasan_check_write+0x14/0x30 [T79] ? srso_alias_return_thunk+0x5/0xfbef5 [T79] ? apparmor_socket_post_create+0x180/0x700 [T79] lock_acquire+0x60/0x140 [T79] ? sock_set_reuseaddr+0x14/0x70 [T79] lock_sock_nested+0x3b/0xf0 [T79] ? sock_set_reuseaddr+0x14/0x70 [T79] sock_set_reuseaddr+0x14/0x70 [T79] siw_create_listen+0x145/0x1540 [siw] [T79] ? srso_alias_return_thunk+0x5/0xfbef5 [T79] ? local_clock_noinstr+0xe/0xd0 [T79] ? __pfx_siw_create_listen+0x10/0x10 [siw] [T79] ? trace_preempt_on+0x4c/0x130 [T79] ? __raw_spin_unlock_irqrestore+0x4a/0x90 [T79] ? srso_alias_return_thunk+0x5/0xfbef5 [T79] ? preempt_count_sub+0x52/0x80 [T79] iw_cm_listen+0x313/0x5b0 [iw_cm] [T79] cma_iw_listen+0x271/0x3c0 [rdma_cm] [T79] ? srso_alias_return_thunk+0x5/0xfbef5 [T79] rdma_listen+0x3b1/0x740 [rdma_cm] [T79] ? _raw_spin_unlock+0x2c/0x60 [T79] ? __pfx_rdma_listen+0x10/0x10 [rdma_cm] [T79] ? rdma_restrack_add+0x12c/0x630 [ib_core] [T79] ? srso_alias_return_thunk+0x5/0xfbef5 [T79] cma_listen_on_dev+0x46a/0x750 [rdma_cm] [T79] rdma_listen+0x4b0/0x740 [rdma_cm] [T79] ? __pfx_rdma_listen+0x10/0x10 [rdma_cm] [T79] ? cma_get_port+0x30d/0x7d0 [rdma_cm] [T79] ? srso_alias_return_thunk+0x5/0xfbef5 [T79] ? rdma_bind_addr_dst+0x598/0x9a0 [rdma_cm] [T79] ksmbd_rdma_init+0x12b/0x270 [ksmbd] [T79] ? __pfx_ksmbd_rdma_init+0x10/0x10 [ksmbd] [T79] ? srso_alias_return_thunk+0x5/0xfbef5 [T79] ? srso_alias_return_thunk+0x5/0xfbef5 [T79] ? register_netdevice_notifier+0x1dc/0x240 [T79] ksmbd_conn_transport_init+0x26/0x70 [ksmbd] [T79] server_ctrl_handle_work+0x1e5/0x280 [ksmbd] [T79] process_one_work+0x86c/0x1930 [T79] ? __pfx_process_one_work+0x10/0x10 [T79] ? srso_alias_return_thunk+0x5/0xfbef5 [T79] ? assign_work+0x16f/0x280 [T79] worker_thread+0x6f0/0x11f0 I was not able to reproduce this as I was testing with various runs switching siw and rxe as well as IPPROTO_SMBDIRECT sockets, while the above stack used siw with the non IPPROTO_SMBDIRECT patches [1]. Even if this patch doesn't solve the above I think it's a good idea to reclassify the sockets used by siw, I also send patches for rxe to reclassify, as well as my IPPROTO_SMBDIRECT socket patches [1] will do it, this should minimize potential false positives. [1] https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/master-ipproto-smbdirect Cc: Bernard Metzler <bernard.metzler@linux.dev> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: linux-rdma@vger.kernel.org Cc: netdev@vger.kernel.org Cc: linux-cifs@vger.kernel.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Link: https://patch.msgid.link/20251126150842.1837072-1-metze@samba.org Acked-by: Bernard Metzler <bernard.metzler@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06	IB/rdmavt: WQ_PERCPU added to alloc_workqueue users	Marco Crivellari
	Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This change adds a new WQ_PERCPU flag to explicitly request alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. CC: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://patch.msgid.link/20251101163121.78400-6-marco.crivellari@suse.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-04	net: Convert proto_ops connect() callbacks to use sockaddr_unsized	Kees Cook
	Update all struct proto_ops connect() callback function prototypes from "struct sockaddr " to "struct sockaddr_unsized " to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-3-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04	net: Convert proto_ops bind() callbacks to use sockaddr_unsized	Kees Cook
	Update all struct proto_ops bind() callback function prototypes from "struct sockaddr " to "struct sockaddr_unsized " to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-2-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-28	RDMA/rxe: Fix null deref on srq->rq.queue after resize failure	Zhu Yanjun
	A NULL pointer dereference can occur in rxe_srq_chk_attr() when ibv_modify_srq() is invoked twice in succession under certain error conditions. The first call may fail in rxe_queue_resize(), which leads rxe_srq_from_attr() to set srq->rq.queue = NULL. The second call then triggers a crash (null deref) when accessing srq->rq.queue->buf->index_mask. Call Trace: <TASK> rxe_modify_srq+0x170/0x480 [rdma_rxe] ? __pfx_rxe_modify_srq+0x10/0x10 [rdma_rxe] ? uverbs_try_lock_object+0x4f/0xa0 [ib_uverbs] ? rdma_lookup_get_uobject+0x1f0/0x380 [ib_uverbs] ib_uverbs_modify_srq+0x204/0x290 [ib_uverbs] ? __pfx_ib_uverbs_modify_srq+0x10/0x10 [ib_uverbs] ? tryinc_node_nr_active+0xe6/0x150 ? uverbs_fill_udata+0xed/0x4f0 [ib_uverbs] ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x2c0/0x470 [ib_uverbs] ? __pfx_ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x10/0x10 [ib_uverbs] ? uverbs_fill_udata+0xed/0x4f0 [ib_uverbs] ib_uverbs_run_method+0x55a/0x6e0 [ib_uverbs] ? __pfx_ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x10/0x10 [ib_uverbs] ib_uverbs_cmd_verbs+0x54d/0x800 [ib_uverbs] ? __pfx_ib_uverbs_cmd_verbs+0x10/0x10 [ib_uverbs] ? __pfx___raw_spin_lock_irqsave+0x10/0x10 ? __pfx_do_vfs_ioctl+0x10/0x10 ? ioctl_has_perm.constprop.0.isra.0+0x2c7/0x4c0 ? __pfx_ioctl_has_perm.constprop.0.isra.0+0x10/0x10 ib_uverbs_ioctl+0x13e/0x220 [ib_uverbs] ? __pfx_ib_uverbs_ioctl+0x10/0x10 [ib_uverbs] __x64_sys_ioctl+0x138/0x1c0 do_syscall_64+0x82/0x250 ? fdget_pos+0x58/0x4c0 ? ksys_write+0xf3/0x1c0 ? __pfx_ksys_write+0x10/0x10 ? do_syscall_64+0xc8/0x250 ? __pfx_vm_mmap_pgoff+0x10/0x10 ? fget+0x173/0x230 ? fput+0x2a/0x80 ? ksys_mmap_pgoff+0x224/0x4c0 ? do_syscall_64+0xc8/0x250 ? do_user_addr_fault+0x37b/0xfe0 ? clear_bhb_loop+0x50/0xa0 ? clear_bhb_loop+0x50/0xa0 ? clear_bhb_loop+0x50/0xa0 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: 8700e3e7c485 ("Soft RoCE driver") Tested-by: Liu Yi <asatsuyu.liu@gmail.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Link: https://patch.msgid.link/20251027215203.1321-1-yanjun.zhu@linux.dev Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-19	RDMA/rxe: Remove redundant assignment to variable page_offset	Colin Ian King
	The variable page_offset is being assigned a value at the start of a loop and being redundantly zero'd at the end of the loop, there is no code that reads the zero'd value. The assignment is redundant and can be removed. Signed-off-by: Colin Ian King <coking@nvidia.com> Link: https://patch.msgid.link/20251014120343.2528608-1-coking@nvidia.com Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-03	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma	Linus Torvalds
	Pull rdma updates from Jason Gunthorpe: "A new Pensando ionic driver, a new Gen 3 HW support for Intel irdma, and lots of small bnxt_re improvements. - Small bug fixes and improves to hfi1, efa, mlx5, erdma, rdmarvt, siw - Allow userspace access to IB service records through the rdmacm - Optimize dma mapping for erdma - Fix shutdown of the GSI QP in mana - Support relaxed ordering MR and fix a corruption bug with mlx5 DMA Data Direct - Many improvement to bnxt_re: - Debugging features and counters - Improve performance of some commands - Change flow_label reporting in completions - Mirror vnic - RDMA flow support - New RDMA driver for Pensando Ethernet devices: ionic - Gen 3 hardware support for the Intel irdma driver - Fix rdma routing resolution with VRFs" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (85 commits) RDMA/ionic: Fix memory leak of admin q_wr RDMA/siw: Always report immediate post SQ errors RDMA/bnxt_re: improve clarity in ALLOC_PAGE handler RDMA/irdma: Remove unused struct irdma_cq fields RDMA/irdma: Fix positive vs negative error codes in irdma_post_send() RDMA/bnxt_re: Remove non-statistics counters from hw_counters RDMA/bnxt_re: Add debugfs info entry for device and resource information RDMA/bnxt_re: Fix incorrect errno used in function comments RDMA: Use %pe format specifier for error pointers RDMA/ionic: Use ether_addr_copy instead of memcpy RDMA/ionic: Fix build failure on SPARC due to xchg() operand size RDMA/rxe: Fix race in do_task() when draining IB/sa: Fix sa_local_svc_timeout_ms read race IB/ipoib: Ignore L3 master device RDMA/core: Use route entry flag to decide on loopback traffic RDMA/core: Resolve MAC of next-hop device without ARP support RDMA/core: Squash a single user static function RDMA/irdma: Update Kconfig RDMA/irdma: Extend CQE Error and Flush Handling for GEN3 Devices RDMA/irdma: Add Atomic Operations support ...
2025-09-26	RDMA/siw: Always report immediate post SQ errors	Bernard Metzler
	In siw_post_send(), any immediate error encountered during processing of the work request list must be reported to the caller, even if previous work requests in that list were just accepted and added to the send queue. Not reporting those errors confuses the caller, which would wait indefinitely for the failing and potentially subsequently aborted work requests completion. This fixes a case where immediate errors were overwritten by subsequent code in siw_post_send(). Fixes: 303ae1cdfdf7 ("rdma/siw: application interface") Link: https://patch.msgid.link/r/20250923144536.103825-1-bernard.metzler@linux.dev Suggested-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Bernard Metzler <bernard.metzler@linux.dev> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-09-21	RDMA/rxe: Fix race in do_task() when draining	Gui-Dong Han
	When do_task() exhausts its iteration budget (!ret), it sets the state to TASK_STATE_IDLE to reschedule, without a secondary check on the current task->state. This can overwrite the TASK_STATE_DRAINING state set by a concurrent call to rxe_cleanup_task() or rxe_disable_task(). While state changes are protected by a spinlock, both rxe_cleanup_task() and rxe_disable_task() release the lock while waiting for the task to finish draining in the while(!is_done(task)) loop. The race occurs if do_task() hits its iteration limit and acquires the lock in this window. The cleanup logic may then proceed while the task incorrectly reschedules itself, leading to a potential use-after-free. This bug was introduced during the migration from tasklets to workqueues, where the special handling for the draining case was lost. Fix this by restoring the original pre-migration behavior. If the state is TASK_STATE_DRAINING when iterations are exhausted, set cont to 1 to force a new loop iteration. This allows the task to finish its work, so that a subsequent iteration can reach the switch statement and correctly transition the state to TASK_STATE_DRAINED, stopping the task as intended. Fixes: 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks") Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Gui-Dong Han <hanguidong02@gmail.com> Link: https://patch.msgid.link/20250919025212.1682087-1-hanguidong02@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-11	RDMA/rdmavt: Use int type to store negative error codes	Qianfeng Rong
	Change 'ret' from u32 to int in alloc_qpn() to store -EINVAL, and remove the 'bail' label as it simply returns 'ret'. Storing negative error codes in an u32 causes no runtime issues, but it's ugly as pants, Change 'ret' from u32 to int type - this change has no runtime impact. Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Link: https://patch.msgid.link/20250826150556.541440-1-rongqianfeng@vivo.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-08-13	RDMA/rxe: Flush delayed SKBs while releasing RXE resources	Zhu Yanjun
	When skb packets are sent out, these skb packets still depends on the rxe resources, for example, QP, sk, when these packets are destroyed. If these rxe resources are released when the skb packets are destroyed, the call traces will appear. To avoid skb packets hang too long time in some network devices, a timestamp is added when these skb packets are created. If these skb packets hang too long time in network devices, these network devices can free these skb packets to release rxe resources. Reported-by: syzbot+8425ccfb599521edb153@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=8425ccfb599521edb153 Tested-by: syzbot+8425ccfb599521edb153@syzkaller.appspotmail.com Fixes: 1a633bdc8fd9 ("RDMA/rxe: Let destroy qp succeed with stuck packet") Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Link: https://patch.msgid.link/20250726013104.463570-1-yanjun.zhu@linux.dev Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-08-07	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma	Linus Torvalds
	Pull rdma fix from Jason Gunthorpe: "Single fix to correct the iov_iter construction in soft iwarp. This avoids blktest crashes with recent changes to the allocators" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: RDMA/siw: Fix the sendmsg byte count in siw_tcp_sendpages
2025-08-05	RDMA/siw: Fix the sendmsg byte count in siw_tcp_sendpages	Pedro Falcato
	Ever since commit c2ff29e99a76 ("siw: Inline do_tcp_sendpages()"), we have been doing this: static int siw_tcp_sendpages(struct socket s, struct page page, int offset, size_t size) [...] / Calculate the number of bytes we need to push, for this page * specifically / size_t bytes = min_t(size_t, PAGE_SIZE - offset, size); / If we can't splice it, then copy it in, as normal / if (!sendpage_ok(page[i])) msg.msg_flags &= ~MSG_SPLICE_PAGES; / Set the bvec pointing to the page, with len $bytes / bvec_set_page(&bvec, page[i], bytes, offset); / Set the iter to $size, aka the size of the whole sendpages (!!!) / iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size); try_page_again: lock_sock(sk); / Sendmsg with $size size (!!!) / rv = tcp_sendmsg_locked(sk, &msg, size); This means we've been sending oversized iov_iters and tcp_sendmsg calls for a while. This has a been a benign bug because sendpage_ok() always returned true. With the recent slab allocator changes being slowly introduced into next (that disallow sendpage on large kmalloc allocations), we have recently hit out-of-bounds crashes, due to slight differences in iov_iter behavior between the MSG_SPLICE_PAGES and "regular" copy paths: (MSG_SPLICE_PAGES) skb_splice_from_iter iov_iter_extract_pages iov_iter_extract_bvec_pages uses i->nr_segs to correctly stop in its tracks before OoB'ing everywhere skb_splice_from_iter gets a "short" read (!MSG_SPLICE_PAGES) skb_copy_to_page_nocache copy=iov_iter_count [...] copy_from_iter / this doesn't help */ if (unlikely(iter->count < len)) len = iter->count; iterate_bvec ... and we run off the bvecs Fix this by properly setting the iov_iter's byte count, plus sending the correct byte count to tcp_sendmsg_locked. Link: https://patch.msgid.link/r/20250729120348.495568-1-pfalcato@suse.de Cc: stable@vger.kernel.org Fixes: c2ff29e99a76 ("siw: Inline do_tcp_sendpages()") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202507220801.50a7210-lkp@intel.com Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Pedro Falcato <pfalcato@suse.de> Acked-by: Bernard Metzler <bernard.metzler@linux.dev> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-07-31	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma	Linus Torvalds
	Pull rdma updates from Jason Gunthorpe: - Various minor code cleanups and fixes for hns, iser, cxgb4, hfi1, rxe, erdma, mana_ib - Prefetch supprot for rxe ODP - Remove memory window support from hns as new device FW is no longer support it - Remove qib, it is very old and obsolete now, Cornelis wishes to restructure the hfi1/qib shared layer - Fix a race in destroying CQs where we can still end up with work running because the work is cancled before the driver stops triggering it - Improve interaction with namespaces: * Follow the devlink namespace for newly spawned RDMA devices * Create iopoib net devces in the parent IB device's namespace * Allow CAP_NET_RAW checks to pass in user namespaces - A new flow control scheme for IB MADs to try and avoid queue overflows in the network - Fix 2G message sizes in bnxt_re - Optimize mkey layout for mlx5 DMABUF - New "DMA Handle" concept to allow controlling PCI TPH and steering tags * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (71 commits) RDMA/siw: Change maintainer email address RDMA/mana_ib: add support of multiple ports RDMA/mlx5: Refactor optional counters steering code RDMA/mlx5: Add DMAH support for reg_user_mr/reg_user_dmabuf_mr IB: Extend UVERBS_METHOD_REG_MR to get DMAH RDMA/mlx5: Add DMAH object support RDMA/core: Introduce a DMAH object and its alloc/free APIs IB/core: Add UVERBS_METHOD_REG_MR on the MR object net/mlx5: Add support for device steering tag net/mlx5: Expose IFC bits for TPH PCI/TPH: Expose pcie_tph_get_st_table_size() RDMA/mlx5: Fix incorrect MKEY masking RDMA/mlx5: Fix returned type from _mlx5r_umr_zap_mkey() RDMA/mlx5: remove redundant check on err on return expression RDMA/mana_ib: add additional port counters RDMA/mana_ib: Fix DSCP value in modify QP RDMA/efa: Add CQ with external memory support RDMA/core: Add umem "is_contiguous" and "start_dma_addr" helpers RDMA/uverbs: Add a common way to create CQ with umem RDMA/mlx5: Optimize DMABUF mkey page size ...
2025-07-23	IB: Extend UVERBS_METHOD_REG_MR to get DMAH	Yishai Hadas
	Extend UVERBS_METHOD_REG_MR to get DMAH and pass it to all drivers. It will be used in mlx5 driver as part of the next patch from the series. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/2ae1e628c0675db81f092cc00d3ad6fbf6139405.1752752567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-18	net: s/dev_get_flags/netif_get_flags/	Stanislav Fomichev
	Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-6-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-26	RDMA/core: Extend RDMA device registration to be net namespace aware	Mark Bloch
	Presently, RDMA devices are always registered within the init network namespace, even if the associated devlink device's namespace was changed via a devlink reload. This mismatch leads to discrepancies between the network namespace of the devlink device and that of the RDMA device. Therefore, extend the RDMA device allocation API to optionally take the net namespace. This isn't limited to devices that support devlink but allows all users to provide the network namespace if they need to do so. If a network namespace is provided during device allocation, it's up to the caller to make sure the namespace stays valid until ib_register_device() is called. Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-06-26	RDMA/rxe: Fix a couple IS_ERR() vs NULL bugs	Dan Carpenter
	The lookup_mr() function returns NULL on error. It never returns error pointers. Fixes: 9284bc34c773 ("RDMA/rxe: Enable asynchronous prefetch for ODP MRs") Fixes: 3576b0df1588 ("RDMA/rxe: Implement synchronous prefetch for ODP MRs") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://patch.msgid.link/685c1430.050a0220.18b0ef.da83@mx.google.com Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26	RDMA/siw: work around clang stack size warning	Arnd Bergmann
	clang inlines a lot of functions into siw_qp_sq_process(), with the aggregate stack frame blowing the warning limit in some configurations: drivers/infiniband/sw/siw/siw_qp_tx.c:1014:5: error: stack frame size (1544) exceeds limit (1280) in 'siw_qp_sq_process' [-Werror,-Wframe-larger-than] The real problem here is the array of kvec structures in siw_tx_hdt that makes up the majority of the consumed stack space. Ideally there would be a way to avoid allocating the array on the stack, but that would require a larger rework. Add a noinline_for_stack annotation to avoid the warning for now, and make clang behave the same way as gcc here. The combined stack usage is still similar, but is spread over multiple functions now. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20250620114332.4072051-1-arnd@kernel.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Acked-by: Bernard Metzler <bmt@zurich.ibm.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-12	RDMA/rxe: Remove redundant page presence check	Daisuke Matsuda
	hmm_pfn_to_page() does not return NULL. ib_umem_odp_map_dma_and_lock() should return an error in case the target pages cannot be mapped until timeout, so these checks can safely be removed. Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Daisuke Matsuda <dskmtsd@gmail.com> Link: https://patch.msgid.link/20250611162758.10000-1-dskmtsd@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-12	RDMA/rxe: Enable asynchronous prefetch for ODP MRs	Daisuke Matsuda
	Calling ibv_advise_mr(3) with flags other than IBV_ADVISE_MR_FLAG_FLUSH invokes an asynchronous request. It is best-effort, and thus can safely be deferred to the system-wide workqueue. The reference counter in rxe_mr is used to ensure that the MRs persist and that rxe is not terminated until the queued work is done. Signed-off-by: Daisuke Matsuda <dskmtsd@gmail.com> Link: https://patch.msgid.link/20250522111955.3227-3-dskmtsd@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-12	RDMA/rxe: Implement synchronous prefetch for ODP MRs	Daisuke Matsuda
	Minimal implementation of ibv_advise_mr(3) requires synchronous calls being successful with the IBV_ADVISE_MR_FLAG_FLUSH flag. Asynchronous requests, which are best-effort, will be supported subsequently. Signed-off-by: Daisuke Matsuda <dskmtsd@gmail.com> Link: https://patch.msgid.link/20250522111955.3227-2-dskmtsd@gmail.com Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-08	treewide, timers: Rename from_timer() to timer_container_of()	Ingo Molnar
	Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-05-30	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma	Linus Torvalds
	Pull rdma updates from Jason Gunthorpe: "Usual collection of driver fixes: - Small bug fixes and cleansup in hfi, hns, rxe, mlx5, mana siw - Further ODP functionality in rxe - Remote access MRs in mana, along with more page sizes - Improve CM scalability with a rwlock around the agent - More trace points for hns - ODP hmm conversion to the new two step dma API - Support the ethernet HW device in mana as well as the RNIC - Cleanups: - Use secs_to_jiffies() when appropriate - Use ERR_CAST() instead of naked casts - Don't use %pK in printk - Unusued functions removed - Allocation type matching" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (57 commits) RDMA/cma: Fix hang when cma_netevent_callback fails to queue_work RDMA/bnxt_re: Support extended stats for Thor2 VF RDMA/hns: Fix endian issue in trace events RDMA/mlx5: Avoid flexible array warning IB/cm: Remove dead code and adjust naming RDMA/core: Avoid hmm_dma_map_alloc() for virtual DMA devices RDMA/rxe: Break endless pagefault loop for RO pages RDMA/bnxt_re: Fix return code of bnxt_re_configure_cc RDMA/bnxt_re: Fix missing error handling for tx_queue RDMA/bnxt_re: Fix incorrect display of inactivity_cp in debugfs output RDMA/mlx5: Add support for 200Gbps per lane speeds RDMA/mlx5: Remove the redundant MLX5_IB_STAGE_UAR stage RDMA/iwcm: Fix use-after-free of work objects after cm_id destruction net: mana: Add support for auxiliary device servicing events RDMA/mana_ib: unify mana_ib functions to support any gdma device RDMA/mana_ib: Add support of mana_ib for RNIC and ETH nic net: mana: Probe rdma device in mana driver RDMA/siw: replace redundant ternary operator with just rv RDMA/umem: Separate implicit ODP initialization from explicit ODP RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage ...
2025-05-26	Merge tag 'v6.15' into rdma.git for-next	Jason Gunthorpe
	Following patches need the RDMA rc branch since we are past the RC cycle now. Merge conflicts resolved based on Linux-next: - For RXE odp changes keep for-next version and fixup new places that need to call is_odp_mr() https://lore.kernel.org/r/20250422143019.500201bd@canb.auug.org.au https://lore.kernel.org/r/20250514122455.3593b083@canb.auug.org.au - irdma is keeping the while/kfree bugfix from -rc and the pf/cdev_info change from for-next https://lore.kernel.org/r/20250513130630.280ee6c5@canb.auug.org.au Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-05-22	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	Jakub Kicinski
	Cross-merge networking fixes after downstream PR (net-6.15-rc8). Conflicts: 80f2ab46c2ee ("irdma: free iwdev->rf after removing MSI-X") 4bcc063939a5 ("ice, irdma: fix an off by one in error handling code") c24a65b6a27c ("iidc/ice/irdma: Update IDC to support multiple consumers") https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au No extra adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-22	RDMA/rxe: Break endless pagefault loop for RO pages	Leon Romanovsky
	RO pages has "perm" equal to 0, that caused to the situation where such pages were marked as needed to have fault and caused to infinite loop. Fixes: eedd5b1276e7 ("RDMA/umem: Store ODP access mask information in PFN") Reported-by: Daisuke Matsuda <dskmtsd@gmail.com> Closes: https://lore.kernel.org/all/3016329a-4edd-4550-862f-b298a1b79a39@gmail.com/ Link: https://patch.msgid.link/096fab178d48ed86942ee22eafe9be98e29092aa.1747913377.git.leonro@nvidia.com Tested-by: Daisuke Matsuda <dskmtsd@gmail.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-05-21	RDMA/siw: use skb_crc32c() instead of __skb_checksum()	Eric Biggers
	Instead of calling __skb_checksum() with a skb_checksum_ops struct that does CRC32C, just call the new function skb_crc32c(). This is faster and simpler. Acked-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Bernard Metzler <bmt@zurich.ibm.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://patch.msgid.link/20250519175012.36581-5-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-12	RDMA/siw: replace redundant ternary operator with just rv	Colin Ian King
	The use of the ternary operator on rv is redundant, rv is either the initialized value of 0 or a negative error return code, so it can never be greater than zero, and hence the zero assignment in ternary operator is redundant. Just return rv instead. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://patch.msgid.link/20250507131834.253823-1-colin.i.king@gmail.com Acked-by: Bernard Metzler <bmt@zurich.ibm.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-12	RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage	Leon Romanovsky
	Reuse newly added DMA API to cache IOVA and only link/unlink pages in fast path for UMEM ODP flow. Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-05-12	RDMA/umem: Store ODP access mask information in PFN	Leon Romanovsky
	As a preparation to remove dma_list, store access mask in PFN pointer and not in dma_addr_t. Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-05-06	RDMA/siw: Remove unused siw_mem_add	Dr. David Alan Gilbert
	siw_mem_add() was added in 2019 by commit 2251334dcac9 ("rdma/siw: application buffer management") but has remained unused. Remove it. Link: https://patch.msgid.link/r/20250505210226.88994-1-linux@treblig.org Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Acked-by: Bernard Metzler <bmt@zurich.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-04-21	RDMA/rxe: Remove 32-bit architecture support	Daisuke Matsuda
	Major linux distibutions have phased out support for 32-bit machines. Since rxe is primarily used for development and testing, the benefit of maintaining 32-bit support is minimal. This change simplifies ATOMIC WRITE implementations and improves maintainability of the driver. Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Link: https://patch.msgid.link/20250421025101.3588139-1-matsuda-daisuke@fujitsu.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-04-20	RDMA/rxe: Remove unused rxe_run_task	Dr. David Alan Gilbert
	rxe_run_task() has been unused since 2024's commit 23bc06af547f ("RDMA/rxe: Don't call direct between tasks") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Link: https://patch.msgid.link/20250419132725.199785-1-linux@treblig.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-04-20	RDMA/rxe: Fix "trying to register non-static key in rxe_qp_do_cleanup" bug	Zhu Yanjun
	Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120 assign_lock_key kernel/locking/lockdep.c:986 [inline] register_lock_class+0x4a3/0x4c0 kernel/locking/lockdep.c:1300 __lock_acquire+0x99/0x1ba0 kernel/locking/lockdep.c:5110 lock_acquire kernel/locking/lockdep.c:5866 [inline] lock_acquire+0x179/0x350 kernel/locking/lockdep.c:5823 __timer_delete_sync+0x152/0x1b0 kernel/time/timer.c:1644 rxe_qp_do_cleanup+0x5c3/0x7e0 drivers/infiniband/sw/rxe/rxe_qp.c:815 execute_in_process_context+0x3a/0x160 kernel/workqueue.c:4596 __rxe_cleanup+0x267/0x3c0 drivers/infiniband/sw/rxe/rxe_pool.c:232 rxe_create_qp+0x3f7/0x5f0 drivers/infiniband/sw/rxe/rxe_verbs.c:604 create_qp+0x62d/0xa80 drivers/infiniband/core/verbs.c:1250 ib_create_qp_kernel+0x9f/0x310 drivers/infiniband/core/verbs.c:1361 ib_create_qp include/rdma/ib_verbs.h:3803 [inline] rdma_create_qp+0x10c/0x340 drivers/infiniband/core/cma.c:1144 rds_ib_setup_qp+0xc86/0x19a0 net/rds/ib_cm.c:600 rds_ib_cm_initiate_connect+0x1e8/0x3d0 net/rds/ib_cm.c:944 rds_rdma_cm_event_handler_cmn+0x61f/0x8c0 net/rds/rdma_transport.c:109 cma_cm_event_handler+0x94/0x300 drivers/infiniband/core/cma.c:2184 cma_work_handler+0x15b/0x230 drivers/infiniband/core/cma.c:3042 process_one_work+0x9cc/0x1b70 kernel/workqueue.c:3238 process_scheduled_works kernel/workqueue.c:3319 [inline] worker_thread+0x6c8/0xf10 kernel/workqueue.c:3400 kthread+0x3c2/0x780 kernel/kthread.c:464 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:153 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> The root cause is as below: In the function rxe_create_qp, the function rxe_qp_from_init is called to create qp, if this function rxe_qp_from_init fails, rxe_cleanup will be called to handle all the allocated resources, including the timers: retrans_timer and rnr_nak_timer. The function rxe_qp_from_init calls the function rxe_qp_init_req to initialize the timers: retrans_timer and rnr_nak_timer. But these timers are initialized in the end of rxe_qp_init_req. If some errors occur before the initialization of these timers, this problem will occur. The solution is to check whether these timers are initialized or not. If these timers are not initialized, ignore these timers. Fixes: 8700e3e7c485 ("Soft RoCE driver") Reported-by: syzbot+4edb496c3cad6e953a31@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=4edb496c3cad6e953a31 Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Link: https://patch.msgid.link/20250419080741.1515231-1-yanjun.zhu@linux.dev Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-04-20	RDMA/rxe: Fix slab-use-after-free Read in rxe_queue_cleanup bug	Zhu Yanjun
	Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x7d/0xa0 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xcf/0x610 mm/kasan/report.c:489 kasan_report+0xb5/0xe0 mm/kasan/report.c:602 rxe_queue_cleanup+0xd0/0xe0 drivers/infiniband/sw/rxe/rxe_queue.c:195 rxe_cq_cleanup+0x3f/0x50 drivers/infiniband/sw/rxe/rxe_cq.c:132 __rxe_cleanup+0x168/0x300 drivers/infiniband/sw/rxe/rxe_pool.c:232 rxe_create_cq+0x22e/0x3a0 drivers/infiniband/sw/rxe/rxe_verbs.c:1109 create_cq+0x658/0xb90 drivers/infiniband/core/uverbs_cmd.c:1052 ib_uverbs_create_cq+0xc7/0x120 drivers/infiniband/core/uverbs_cmd.c:1095 ib_uverbs_write+0x969/0xc90 drivers/infiniband/core/uverbs_main.c:679 vfs_write fs/read_write.c:677 [inline] vfs_write+0x26a/0xcc0 fs/read_write.c:659 ksys_write+0x1b8/0x200 fs/read_write.c:731 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xaa/0x1b0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f In the function rxe_create_cq, when rxe_cq_from_init fails, the function rxe_cleanup will be called to handle the allocated resources. In fact, some memory resources have already been freed in the function rxe_cq_from_init. Thus, this problem will occur. The solution is to let rxe_cleanup do all the work. Fixes: 8700e3e7c485 ("Soft RoCE driver") Link: https://paste.ubuntu.com/p/tJgC42wDf6/ Tested-by: liuyi <liuy22@mails.tsinghua.edu.cn> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Link: https://patch.msgid.link/20250412075714.3257358-1-yanjun.zhu@linux.dev Reviewed-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-04-11	RDMA/rxe: Fix mismatched type declarations	Daisuke Matsuda
	Some functions return int values while they are defined as enum resp_states variables. This patch resolves the mismatches in rxe. Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Link: https://patch.msgid.link/20250409102701.1275265-1-matsuda-daisuke@fujitsu.com Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-04-09	RDMA: Don't use %pK through printk	Thomas Weißschuh
	In the past %pK was preferable to %p as it would not leak raw pointer values into the kernel log. Since commit ad67b74d2469 ("printk: hash addresses printed with %p") the regular %p has been improved to avoid this issue. Furthermore, restricted pointers ("%pK") were never meant to be used through printk(). They can still unintentionally leak raw pointers or acquire sleeping looks in atomic contexts. Switch to the regular pointer formatting which is safer and easier to reason about. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Link: https://patch.msgid.link/20250407-restricted-pointers-infiniband-v1-1-22b20504b84d@linutronix.de Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-04-09	RDMA/rxe: Enable ODP in ATOMIC WRITE operation	Daisuke Matsuda
	Add rxe_odp_do_atomic_write() so that ODP specific steps are applied to ATOMIC WRITE requests. Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Link: https://patch.msgid.link/20250324075649.3313968-3-matsuda-daisuke@fujitsu.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-04-08	RDMA/rxe: Enable ODP in RDMA FLUSH operation	Daisuke Matsuda
	For persistent memories, add rxe_odp_flush_pmem_iova() so that ODP specific steps are executed. Otherwise, no additional consideration is required. Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Link: https://patch.msgid.link/20250324075649.3313968-2-matsuda-daisuke@fujitsu.com Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-04-07	RDMA/rxe: Fix null pointer dereference in ODP MR check	Li Zhijian
	The blktests/rnbd reported a null pointer dereference as following. Similar to the mlx5, introduce a is_odp_mr() to check if the odp is enabled in this mr. Workqueue: rxe_wq do_work [rdma_rxe] RIP: 0010:rxe_mr_copy+0x57/0x210 [rdma_rxe] Code: 7c 04 48 89 f3 48 89 d5 41 89 cf 45 89 c4 0f 84 dc 00 00 00 89 ca e8 f8 f8 ff ff 85 c0 0f 85 75 01 00 00 49 8b 86 f0 00 00 00 <f6> 40 28 02 0f 85 98 01 00 00 41 8b 46 78 41 8b 8e 10 01 00 00 8d RSP: 0018:ffffa0aac02cfcf8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff9079cd440024 RCX: 0000000000000000 RDX: 000000000000003c RSI: ffff9079cd440060 RDI: ffff9079cd665600 RBP: ffff9079c0e5e45a R08: 0000000000000000 R09: 0000000000000000 R10: 000000003c000000 R11: 0000000000225510 R12: 0000000000000000 R13: 0000000000000000 R14: ffff9079cd665600 R15: 000000000000003c FS: 0000000000000000(0000) GS:ffff907ccfa80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 0000000119498001 CR4: 00000000001726f0 Call Trace: <TASK> ? __die_body+0x1e/0x60 ? page_fault_oops+0x14f/0x4c0 ? rxe_mr_copy+0x57/0x210 [rdma_rxe] ? search_bpf_extables+0x5f/0x80 ? exc_page_fault+0x7e/0x180 ? asm_exc_page_fault+0x26/0x30 ? rxe_mr_copy+0x57/0x210 [rdma_rxe] ? rxe_mr_copy+0x48/0x210 [rdma_rxe] ? rxe_pool_get_index+0x50/0x90 [rdma_rxe] rxe_receiver+0x1d98/0x2530 [rdma_rxe] ? psi_task_switch+0x1ff/0x250 ? finish_task_switch+0x92/0x2d0 ? __schedule+0xbdf/0x17c0 do_task+0x65/0x1e0 [rdma_rxe] process_scheduled_works+0xaa/0x3f0 worker_thread+0x117/0x240 Fixes: d03fb5c6599e ("RDMA/rxe: Allow registering MRs for On-Demand Paging") Cc: stable@vger.kernel.org Link: https://patch.msgid.link/r/20250402032657.1762800-1-lizhijian@fujitsu.com Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-04-05	treewide: Switch/rename to timer_delete[_sync]()	Thomas Gleixner
	timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-29	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma	Linus Torvalds
	Pull rdma updates from Jason Gunthorpe: - Usual minor updates and fixes for bnxt_re, hfi1, rxe, mana, iser, mlx5, vmw_pvrdma, hns - Make rxe work on tun devices - mana gains more standard verbs as it moves toward supporting in-kernel verbs - DMABUF support for mana - Fix page size calculations when memory registration exceeds 4G - On Demand Paging support for rxe - mlx5 support for RDMA TRANSPORT flow tables and a new ucap mechanism to access control use of them - Optional RDMA_TX/RX counters per QP in mlx5 * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (73 commits) IB/mad: Check available slots before posting receive WRs RDMA/mana_ib: Fix integer overflow during queue creation RDMA/mlx5: Fix calculation of total invalidated pages RDMA/mlx5: Fix mlx5_poll_one() cur_qp update flow RDMA/mlx5: Fix page_size variable overflow RDMA/mlx5: Drop access_flags from _mlx5_mr_cache_alloc() RDMA/mlx5: Fix cache entry update on dereg error RDMA/mlx5: Fix MR cache initialization error flow RDMA/mlx5: Support optional-counters binding for QPs RDMA/mlx5: Compile fs.c regardless of INFINIBAND_USER_ACCESS config RDMA/core: Pass port to counter bind/unbind operations RDMA/core: Add support to optional-counters binding configuration RDMA/core: Create and destroy rdma_counter using rdma_zalloc_drv_obj() RDMA/mlx5: Add optional counters for RDMA_TX/RX_packets/bytes RDMA/core: Fix use-after-free when rename device name RDMA/bnxt_re: Support perf management counters RDMA/rxe: Fix incorrect return value of rxe_odp_atomic_op() RDMA/uverbs: Propagate errors from rdma_lookup_get_uobject() RDMA/mana_ib: Handle net event for pointing to the current netdev net: mana: Change the function signature of mana_get_primary_netdev_rcu ...
2025-03-25	Merge tag 'crc-for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull CRC updates from Eric Biggers: "Another set of improvements to the kernel's CRC (cyclic redundancy check) code: - Rework the CRC64 library functions to be directly optimized, like what I did last cycle for the CRC32 and CRC-T10DIF library functions - Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ support and acceleration for crc64_be and crc64_nvme - Rewrite the riscv Zbc-optimized CRC code, and add acceleration for crc_t10dif, crc64_be, and crc64_nvme - Remove crc_t10dif and crc64_rocksoft from the crypto API, since they are no longer needed there - Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect - Add kunit test cases for crc64_nvme and crc7 - Eliminate redundant functions for calculating the Castagnoli CRC32, settling on just crc32c() - Remove unnecessary prompts from some of the CRC kconfig options - Further optimize the x86 crc32c code" * tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (36 commits) x86/crc: drop the avx10_256 functions and rename avx10_512 to avx512 lib/crc: remove unnecessary prompt for CONFIG_CRC64 lib/crc: remove unnecessary prompt for CONFIG_LIBCRC32C lib/crc: remove unnecessary prompt for CONFIG_CRC8 lib/crc: remove unnecessary prompt for CONFIG_CRC7 lib/crc: remove unnecessary prompt for CONFIG_CRC4 lib/crc7: unexport crc7_be_syndrome_table lib/crc_kunit.c: update comment in crc_benchmark() lib/crc_kunit.c: add test and benchmark for crc7_be() x86/crc32: optimize tail handling for crc32c short inputs riscv/crc64: add Zbc optimized CRC64 functions riscv/crc-t10dif: add Zbc optimized CRC-T10DIF function riscv/crc32: reimplement the CRC32 functions using new template riscv/crc: add "template" for Zbc optimized CRC functions x86/crc: add ANNOTATE_NOENDBR to suppress objtool warnings x86/crc32: improve crc32c_arch() code generation with clang x86/crc64: implement crc64_be and crc64_nvme using new template x86/crc-t10dif: implement crc_t10dif using new template x86/crc32: implement crc32_le using new template x86/crc: add "template" for [V]PCLMULQDQ based CRC functions ...
2025-03-25	Merge tag 'timers-cleanups-2025-03-23' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "A treewide hrtimer timer cleanup hrtimers are initialized with hrtimer_init() and a subsequent store to the callback pointer. This turned out to be suboptimal for the upcoming Rust integration and is obviously a silly implementation to begin with. This cleanup replaces the hrtimer_init(T); T->function = cb; sequence with hrtimer_setup(T, cb); The conversion was done with Coccinelle and a few manual fixups. Once the conversion has completely landed in mainline, hrtimer_init() will be removed and the hrtimer::function becomes a private member" * tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits) wifi: rt2x00: Switch to use hrtimer_update_function() io_uring: Use helper function hrtimer_update_function() serial: xilinx_uartps: Use helper function hrtimer_update_function() ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup() RDMA: Switch to use hrtimer_setup() virtio: mem: Switch to use hrtimer_setup() drm/vmwgfx: Switch to use hrtimer_setup() drm/xe/oa: Switch to use hrtimer_setup() drm/vkms: Switch to use hrtimer_setup() drm/msm: Switch to use hrtimer_setup() drm/i915/request: Switch to use hrtimer_setup() drm/i915/uncore: Switch to use hrtimer_setup() drm/i915/pmu: Switch to use hrtimer_setup() drm/i915/perf: Switch to use hrtimer_setup() drm/i915/gvt: Switch to use hrtimer_setup() drm/i915/huc: Switch to use hrtimer_setup() drm/amdgpu: Switch to use hrtimer_setup() stm class: heartbeat: Switch to use hrtimer_setup() i2c: Switch to use hrtimer_setup() iio: Switch to use hrtimer_setup() ...
2025-03-13	RDMA/rxe: Fix incorrect return value of rxe_odp_atomic_op()	Daisuke Matsuda
	rxe_mr_do_atomic_op() returns enum resp_states numbers, so the ODP counterpart must not return raw errno codes. Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Link: https://patch.msgid.link/20250313064540.2619115-1-matsuda-daisuke@fujitsu.com Signed-off-by: Leon Romanovsky <leon@kernel.org>