linux-toradex.git/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c, branch v6.16-rc5

drm/amdgpu: Implement unrecoverable error message handling for VFs

2025-05-07T21:43:13+00:00

This notification may arrive in VF mailbox while polling for response from
another event.

This patches covers the following scenarios:

- If VF is already in RMA state, then do not attempt to contact the host.
  Host will ignore the VF after sending the notification.

- If the notification is detected during polling, then set the RMA status,
  and return error to caller.

- If the notification arrives by interrupt, then set the RMA status and
  queue a reset.  This reset will fail and VF will stop runtime services.

Reviewed-by: Shravan Kumar Gande 
Signed-off-by: Victor Skvortsov 
Signed-off-by: Ellen Pan 
Signed-off-by: Alex Deucher

drm/amdgpu: Implement Runtime Bad Page query for VFs

2025-05-07T21:41:49+00:00

Host will send a notification when new bad pages are available.

Uopn guest request, the first 256 bad page addresses
will be placed into the PF2VF region.
Guest should pause the PF2VF worker thread while
the copy is in progress.

Reviewed-by: Shravan Kumar Gande 
Signed-off-by: Victor Skvortsov 
Signed-off-by: Ellen Pan 
Signed-off-by: Alex Deucher

drm/amdgpu: process RAS fatal error MB notification

2024-06-27T21:31:37+00:00

For RAS error scenario, VF guest driver will check mailbox
and set fed flag to avoid unnecessary HW accesses.
additionally, poll for reset completion message first
to avoid accidentally spamming multiple reset requests to host.

v2: add another mailbox check for handling case where kfd detects
timeout first

v3: set host_flr bit and use wait_for_reset

Signed-off-by: Vignesh Chander 
Reviewed-by: Zhigang Luo 
Signed-off-by: Alex Deucher

drm/amdgpu: Use dev_ prints for virtualization as it supports multi adapter

2024-06-27T21:30:39+00:00

So we can get clearer per device logging.

Signed-off-by: Vignesh Chander 
Reviewed-by: Zhigang Luo 
Signed-off-by: Alex Deucher

drm/amdgpu: fix sriov host flr handler

2024-06-14T20:15:58+00:00

We send back the ready to reset message before we stop anything. This is
wrong. Move it to when we are actually ready for the FLR to happen.

In the current state since we take tens of seconds to stop everything,
it is very likely that host would give up waiting and reset the GPU
before we send ready, so it would be the same as before. But this gets
rid of the hack with reset_domain locking and also let us tell how slow
ready to reset actually is from the host. The ready to reset speed can
be improved later.

Signed-off-by: Yunxiang Li 
Acked-by: Christian König 
Reviewed-by: Emily Deng 
Signed-off-by: Alex Deucher

drm/amdgpu: Add reset_context flag for host FLR

2024-05-02T19:40:50+00:00

There are other reset sources that pass NULL as the job pointer, such as
amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if
the FLR comes from the host does not work.

Add a flag in reset_context to explicitly mark host triggered reset, and
set this flag when we receive host reset notification.

Signed-off-by: Yunxiang Li 
Reviewed-by: Emily Deng 
Reviewed-by: Zhigang Luo 
Signed-off-by: Alex Deucher

drm/amdgpu: Fix two reset triggered in a row

2024-05-02T19:40:44+00:00

Some times a hang GPU causes multiple reset sources to schedule resets.
The second source will be able to trigger an unnecessary reset if they
schedule after we call amdgpu_device_stop_pending_resets.

Move amdgpu_device_stop_pending_resets to after the reset is done. Since
at this point the GPU is supposedly in a good state, any reset scheduled
after this point would be a legitimate reset.

Remove unnecessary and incorrect checks for amdgpu_in_reset that was
kinda serving this purpose.

Signed-off-by: Yunxiang Li 
Reviewed-by: Lijo Lazar 
Signed-off-by: Alex Deucher

drm/amdgpu: trigger flr_work if reading pf2vf data failed

2024-03-20T17:38:13+00:00

if reading pf2vf data failed 30 times continuously, it means something is
wrong. Need to trigger flr_work to recover the issue.

also use dev_err to print the error message to get which device has
issue and add warning message if waiting IDH_FLR_NOTIFICATION_CMPL
timeout.

Signed-off-by: Zhigang Luo 
Acked-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: Support passing poison consumption ras block to SRIOV

2024-01-25T19:58:03+00:00

Support passing poison consumption ras blocks
to SRIOV.

Signed-off-by: YiPeng Chai 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: add RAS poison consumption handler for AI SRIOV

2022-12-15T17:18:19+00:00

Send message to host and host will handle it.

v2: split the patch into two parts, one is for mxgpu ai and another one
is for common poison consumption handler.

Signed-off-by: Tao Zhou 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher