drm/amdgpu: Implement user queue reset functionality

This patch adds robust reset handling for user queues (userq) to improve recovery from queue failures. The key components include: 1. Queue detection and reset logic: - amdgpu_userq_detect_and_reset_queues() identifies failed queues - Per-IP detect_and_reset callbacks for targeted recovery - Falls back to full GPU reset when needed 2. Reset infrastructure: - Adds userq_reset_work workqueue for async reset handling - Implements pre/post reset handlers for queue state management - Integrates with existing GPU reset framework 3. Error handling improvements: - Enhanced state tracking with HUNG state - Automatic reset triggering on critical failures - VRAM loss handling during recovery 4. Integration points: - Added to device init/reset paths - Called during queue destroy, suspend, and isolation events - Handles both individual queue and full GPU resets The reset functionality works with both gfx/compute and sdma queues, providing better resilience against queue failures while minimizing disruption to unaffected queues. v2: add detection and reset calls when preemption/unmaped fails. add a per device userq counter for each user queue type.(Alex) v3: make sure we hold the adev->userq_mutex when we call amdgpu_userq_detect_and_reset_queues. (Alex) warn if the adev->userq_mutex is not held. v4: make sure we have all of the uqm->userq_mutex held. warn if the uqm->userq_mutex is not held. v5: Use array for user queue type counters.(Alex) all of the uqm->userq_mutex need to be held when calling detect and reset. (Alex) v6: fix lock dep warning in amdgpu_userq_fence_dence_driver_process v7: add the queue types in an array and use a loop in amdgpu_userq_detect_and_reset_queues (Lijo) v8: remove atomic_set(&userq_mgr->userq_count[i], 0). it should already be 0 since we kzalloc the structure (Alex) v9: For consistency with kernel queues, We may want something like: amdgpu_userq_is_reset_type_supported (Alex) Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
author: Jesse.Zhang <Jesse.Zhang@amd.com> 2025-10-24 10:51:52 +0800
committer: Alex Deucher <alexander.deucher@amd.com> 2025-11-04 11:53:05 -0500
commit: 290f46cf5726509f55fac4c7abe9b9d311aa5a3a (patch)
tree: 9e98961c9984aeed25d480774efd672b3f233516 /drivers/gpu/drm/amd/amdgpu/amdgpu_userq.h
parent: 825df7ff4bb1a383ad4827545e09aec60d230770 (diff)
1 files changed, 5 insertions, 0 deletions
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.h
index 09da0617bfa2..c37444427a14 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.h
@@ -106,6 +106,7 @@ struct amdgpu_userq_mgr {
 	struct amdgpu_device		*adev;
 	struct delayed_work		resume_work;
 	struct drm_file			*file;
+	atomic_t                        userq_count[AMDGPU_RING_TYPE_MAX];
 };
 
 struct amdgpu_db_info {
@@ -148,6 +149,10 @@ int amdgpu_userq_stop_sched_for_enforce_isolation(struct amdgpu_device *adev,
 						  u32 idx);
 int amdgpu_userq_start_sched_for_enforce_isolation(struct amdgpu_device *adev,
 						   u32 idx);
+void amdgpu_userq_reset_work(struct work_struct *work);
+void amdgpu_userq_pre_reset(struct amdgpu_device *adev);
+int amdgpu_userq_post_reset(struct amdgpu_device *adev, bool vram_lost);
+
 int amdgpu_userq_input_va_validate(struct amdgpu_usermode_queue *queue,
 				   u64 addr, u64 expected_size);
 int amdgpu_userq_gem_va_unmap_validate(struct amdgpu_device *adev,
author	Jesse.Zhang <Jesse.Zhang@amd.com>	2025-10-24 10:51:52 +0800
committer	Alex Deucher <alexander.deucher@amd.com>	2025-11-04 11:53:05 -0500
commit	290f46cf5726509f55fac4c7abe9b9d311aa5a3a (patch)
tree	9e98961c9984aeed25d480774efd672b3f233516 /drivers/gpu/drm/amd/amdgpu/amdgpu_userq.h
parent	825df7ff4bb1a383ad4827545e09aec60d230770 (diff)