linux-toradex.git/drivers/gpu/drm/xe, branch master

drm/xe: fix job timeout recovery for unstarted jobs and kernel queues

2026-06-11T13:39:43+00:00

A job that GuC never scheduled (never started) indicates a GuC
scheduling failure; previously such jobs were silently errored out
instead of triggering a GT reset to recover. Trigger a GT reset and
resubmit them, but only when the queue was not already killed or banned:
an unstarted job on an already banned queue is the ban working as
intended and must neither clear the ban nor kick off a reset, otherwise
a banned userspace queue could be resurrected and spam GT resets.

Kernel queues are always recovered this way and wedge the device once
recovery attempts are exhausted, since kernel work must not silently
fail. A started job that times out on a userspace VM bind queue stays
banned rather than being reset and retried.

The queue is banned early in the timeout handler to signal the G2H
scheduling-done handler so it wakes the disable-scheduling waiter;
without it the waiter sleeps the full 5s timeout. When a reset is
warranted the ban is cleared before rearming so that
guc_exec_queue_start() can resubmit jobs after the GT reset - a
still-banned queue would block resubmission and cause an infinite TDR
loop. The already-banned case is gated out before this point via
skip_timeout_check, so it is unaffected.

v2: (Himal) Do it for any queue type, not just kernel/migration
v3: - (Sashiko and Sanjay): don't clear the ban / GT reset for already
      killed/banned queues on unstarted-job timeout
    - Update commit message
    - (Matt) Add Fixes tag

Fixes: fe05cee4d953 ("drm/xe: Don't short circuit TDR on jobs not started")
Cc: Matthew Auld 
Cc: Matthew Brost 
Cc: Sanjay Yadav 
Cc: Himal Prasad Ghimiray 
Assisted-by: GitHub-Copilot:claude-sonnet-4.6
Assisted-by: GitHub-Copilot:claude-opus-4.8
Tested-by: Sanjay Yadav 
Reviewed-by: Sanjay Yadav 
Reviewed-by: Matthew Brost 
Reviewed-by: Himal Prasad Ghimiray 
Link: https://patch.msgid.link/20260610152548.404575-3-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi 
(cherry picked from commit b1107d085e7e8ed15ba6f80c102528a9c8a6cb0e)
Signed-off-by: Matthew Brost

drm/xe: fix refcount leak in xe_range_fence_insert()

2026-06-11T13:39:40+00:00

xe_range_fence_insert() acquires a reference on fence via
dma_fence_get() and stores it in rfence->fence.  It then calls
dma_fence_add_callback() and handles two cases: when the callback
is successfully registered (err == 0) the fence is transferred to
the tree for later cleanup; when the fence is already signaled
(err == -ENOENT) it manually drops the extra reference with
dma_fence_put(fence).

However, dma_fence_add_callback() can fail with other errors
(e.g. -EINVAL) and in that case the code falls through to the free:
label without releasing the acquired reference, leaking it.

Fix the leak by adding an else branch that calls dma_fence_put()
before jumping to free: for any error other than -ENOENT.

Fixes: 845f64bdbfc9 ("drm/xe: Introduce a range-fence utility")
Signed-off-by: Wentao Liang 
Reviewed-by: Matthew Brost 
Signed-off-by: Matthew Brost 
Link: https://patch.msgid.link/20260610172705.3450560-1-matthew.brost@intel.com
(cherry picked from commit 98c4a4201290823c2c5c7ba21692bd9a64b61021)
Signed-off-by: Matthew Brost

drm/xe: include all registered queues in TLB invalidation

2026-06-10T16:33:29+00:00

Context-based TLB invalidation currently selects only scheduling-active
exec queues via q->ops->active(). During rebind flows, queues may be
suspended (or transitioning through resume) while still owning valid
translations, causing them to be skipped from invalidation and leading
to missed TLB invalidations on LR rebinds.

The underlying issue is a TOCTOU: q->guc->state bits are flipped lock-free
from enable_scheduling(), disable_scheduling{,_deregister}(), the
suspend/resume sched-msg handlers, handle_sched_done(), and
guc_exec_queue_stop(); nothing in send_tlb_inval_ctx_ppgtt() serializes
against them, so any state-based predicate can race.

Include all the registered queues so that TLB invalidations are not
missed. This is race-free because list membership on vm->exec_queues.list
is stable under vm->exec_queues.lock held by the caller. The performance
impact is expected to be minimal and harmless. If it does turn out to be
a concern, we can come back with a race-safe solution to ignore certain
queues.

Fixes: 6cdaa5346d6f ("drm/xe: Add context-based invalidation to GuC TLB invalidation backend")
Assisted-by: Claude:claude-opus-4.6
Suggested-by: Thomas Hellstrom 
Signed-off-by: Tangudu Tilak Tirumalesh 
Reviewed-by: Thomas Hellström 
Reviewed-by: Matthew Brost 
Link: https://patch.msgid.link/20260608162745.338725-2-tilak.tirumalesh.tangudu@intel.com
Signed-off-by: Shuicheng Lin 
(cherry picked from commit aa625e1e9f0710e424fe4f0e3f032807df81b5b0)
Signed-off-by: Matthew Brost

drm/xe/hw_error: Use HW_ERR prefix in log

2026-06-10T16:33:25+00:00

Hardware errors should be logged with HW_ERR prefix. Make them
consistent with existing logs.

Fixes: 01aab7e1c9d4 ("drm/xe/xe_hw_error: Add support for PVC SoC errors")
Signed-off-by: Raag Jadav 
Reviewed-by: Riana Tauro 
Link: https://patch.msgid.link/20260602044919.702209-5-raag.jadav@intel.com
Signed-off-by: Matt Roper 
(cherry picked from commit ad60a618c49fef07d1860bfb1091140d29f5eddb)
Signed-off-by: Matthew Brost

drm/xe/drm_ras: Add per node cleanup action

2026-06-10T16:33:22+00:00

cleanup_node_param() is not registered for previous node in case of counter
allocation failure, which results in stale memory of previous node that
isn't cleaned up on unwind. Add per node cleanup action which guarantees
cleanup on unwind and also simplifies the cleanup logic.

Fixes: b40db12b542f ("drm/xe/xe_drm_ras: Add support for XE DRM RAS")
Signed-off-by: Raag Jadav 
Reviewed-by: Riana Tauro 
Link: https://patch.msgid.link/20260602044919.702209-4-raag.jadav@intel.com
Signed-off-by: Matt Roper 
(cherry picked from commit 67fc5543d8274b2fcbef87734fad0469358f4478)
Signed-off-by: Matthew Brost

drm/xe/drm_ras: Make counter allocation drm managed

2026-06-10T16:33:18+00:00

cleanup_node_param() is not registered for previous node in case of counter
allocation failure, which results in stale memory of previous node that
isn't cleaned up on unwind. Fix this using drm managed allocation, which is
guaranteed to be cleaned up on unwind.

Fixes: b40db12b542f ("drm/xe/xe_drm_ras: Add support for XE DRM RAS")
Signed-off-by: Raag Jadav 
Reviewed-by: Riana Tauro 
Link: https://patch.msgid.link/20260602044919.702209-3-raag.jadav@intel.com
Signed-off-by: Matt Roper 
(cherry picked from commit 58d77c77ea0c5cb2b755ebe23e973c8272acd896)
Signed-off-by: Matthew Brost

drm/xe/display: fix oops in suspend/shutdown without display

2026-06-10T16:33:09+00:00

The xe driver keeps track of whether to probe display, and whether
display hardware is there, using xe->info.probe_display. It gets set to
false if there's no display after intel_display_device_probe(). However,
the display may also be disabled via fuses, detected at a later time in
intel_display_device_info_runtime_init().

In this case, the xe driver does for_each_intel_crtc() on uninitialized
mode config in xe_display_flush_cleanup_work(), leading to a NULL
pointer dereference, and generally calls display code with display info
cleared.

Check for intel_display_device_present() after
intel_display_device_info_runtime_init(), and reset
xe->info.probe_display as necessary. Also do unset_display_features()
for completeness, although display runtime init has already done
that. This will need to be unified across all cases later.

Move intel_display_device_info_runtime_init() call slightly earlier,
similar to i915, to avoid a bunch of unnecessary setup for no display
cases.

Note #1: The xe driver has no business doing low level display plumbing
like for_each_intel_crtc() to begin with. It all needs to happen in
display code.

Note #2: The actual bug is present already in commit 44e694958b95
("drm/xe/display: Implement display support"), but the oops was likely
introduced later at commit ddf6492e0e50 ("drm/xe/display: Make display
suspend/resume work on discrete").

Fixes: 44e694958b95 ("drm/xe/display: Implement display support")
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/work_items/7904
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/work_items/6150
Cc: stable@vger.kernel.org # v6.8+
Reviewed-by: Suraj Kandpal 
Link: https://patch.msgid.link/20260515160920.1082842-1-jani.nikula@intel.com
Signed-off-by: Jani Nikula 
(cherry picked from commit 7c3eb9f47533220888a67266448185fd0775d4da)
Signed-off-by: Matthew Brost

drm/xe/multi_queue: skip submit when primary queue is suspended

2026-06-04T13:04:09+00:00

Return early in submit path when the multi-queue primary exec
queue is suspended to avoid submitting while suspended.

v2: Remove idle_skip_suspend fix as that feature is being
reverted here https://patchwork.freedesktop.org/series/167262/

Fixes: bc5775c59258 ("drm/xe/multi_queue: Add GuC interface for multi queue support")
Cc: stable@vger.kernel.org # v7.0+
Assisted-by: GitHub-Copilot:claude-sonnet-4.6
Reviewed-by: Daniele Ceraolo Spurio 
Signed-off-by: Niranjana Vishwanathapura 
Link: https://patch.msgid.link/20260603233946.863663-2-niranjana.vishwanathapura@intel.com
(cherry picked from commit b7fb55cc3364ca128cfff9d50649ffd4327cd01e)
Signed-off-by: Rodrigo Vivi

drm/xe: Clear pending_disable before signaling suspend fence

2026-06-04T13:04:03+00:00

In the schedule-disable done path for suspend, we
signal the suspend fence before clearing pending_disable.

That wakeup can let suspend_wait complete and resume be queued
immediately. The resume path may then reach enable_scheduling()
while pending_disable is still set and hit the
!exec_queue_pending_disable(q) assertion.

Fix this by clearing pending_disable before signaling
the suspend fence, so any resumed transition observes a
consistent state.

Fixes: 87651f31ae4e ("drm/xe/guc_submit: fix race around suspend_pending")
Cc: stable@vger.kernel.org # v7.0+
Signed-off-by: Tangudu Tilak Tirumalesh 
Reviewed-by: Thomas Hellstrom 
Signed-off-by: Daniele Ceraolo Spurio 
Link: https://patch.msgid.link/20260603065217.3131066-3-tilak.tirumalesh.tangudu@intel.com
(cherry picked from commit 4b1ae138b0e103d753773956a84eebc2edbf62c4)
Signed-off-by: Rodrigo Vivi

Revert "drm/xe: Skip exec queue schedule toggle if queue is idle during suspend"

2026-06-04T13:03:57+00:00

This reverts commit 8533051ce92015e9cc6f75e0d52119b9d91610b6.

The idle-skip optimization bypasses GuC suspend, so the GPU may not
perform the context switch that flushes TLB entries for invalidated
userptr VMAs. In LR/preempt-fence VM mode, this can lead to missed TLB
invalidation and page faults during userptr invalidation tests.

Restore unconditional schedule toggling on suspend so the context-switch
TLB flush is always performed.

This optimization will be reintroduced with a fix that does not skip
suspend in LR/preempt-fence VM mode.

Fixes: 8533051ce920 ("drm/xe: Skip exec queue schedule toggle if queue is idle during suspend")
Cc: stable@vger.kernel.org # v7.0+
Suggested-by: Thomas Hellstrom 
Signed-off-by: Tangudu Tilak Tirumalesh 
Reviewed-by: Thomas Hellstrom 
Signed-off-by: Daniele Ceraolo Spurio 
Link: https://patch.msgid.link/20260603065217.3131066-2-tilak.tirumalesh.tangudu@intel.com
(cherry picked from commit 6a1e7934d9a6cf46aecae00a99c2603d1295e170)
Signed-off-by: Rodrigo Vivi