linux-toradex.git/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h, branch v6.16-rc5

drm/amdgpu: Implement unrecoverable error message handling for VFs

2025-05-07T21:43:13+00:00

This notification may arrive in VF mailbox while polling for response from
another event.

This patches covers the following scenarios:

- If VF is already in RMA state, then do not attempt to contact the host.
  Host will ignore the VF after sending the notification.

- If the notification is detected during polling, then set the RMA status,
  and return error to caller.

- If the notification arrives by interrupt, then set the RMA status and
  queue a reset.  This reset will fail and VF will stop runtime services.

Reviewed-by: Shravan Kumar Gande 
Signed-off-by: Victor Skvortsov 
Signed-off-by: Ellen Pan 
Signed-off-by: Alex Deucher

drm/amdgpu: Implement Runtime Bad Page query for VFs

2025-05-07T21:41:49+00:00

Host will send a notification when new bad pages are available.

Uopn guest request, the first 256 bad page addresses
will be placed into the PF2VF region.
Guest should pause the PF2VF worker thread while
the copy is in progress.

Reviewed-by: Shravan Kumar Gande 
Signed-off-by: Victor Skvortsov 
Signed-off-by: Ellen Pan 
Signed-off-by: Alex Deucher

drm/amdgpu: process RAS fatal error MB notification

2024-06-27T21:31:37+00:00

For RAS error scenario, VF guest driver will check mailbox
and set fed flag to avoid unnecessary HW accesses.
additionally, poll for reset completion message first
to avoid accidentally spamming multiple reset requests to host.

v2: add another mailbox check for handling case where kfd detects
timeout first

v3: set host_flr bit and use wait_for_reset

Signed-off-by: Vignesh Chander 
Reviewed-by: Zhigang Luo 
Signed-off-by: Alex Deucher

drm/amdgpu: Fix complex macros error

2023-10-05T21:59:35+00:00

Fixes the below:

ERROR: Macros with complex values should be enclosed in parentheses

WARNING: macros should not use a trailing semicolon
+#define amdgpu_inc_vram_lost(adev) atomic_inc(&((adev)->vram_lost_counter));

Cc: Christian König 
Cc: Alex Deucher 
Cc: "Pan, Xinhui" 
Signed-off-by: Srinivasan Shanmugam 
Reviewed-by: Christian König 
Signed-off-by: Alex Deucher

drm/amdgpu: add RAS poison consumption handler for AI SRIOV

2022-12-15T17:18:19+00:00

Send message to host and host will handle it.

v2: split the patch into two parts, one is for mxgpu ai and another one
is for common poison consumption handler.

Signed-off-by: Tao Zhou 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: add dummy event6 for vega10

2022-01-07T22:19:34+00:00

[why]
Malicious mailbox event1 fails driver loading on vega10.
A dummy event6 prevent driver from taking response from malicious event1 as its own.

[how]
On vega10, send a mailbox event6 before sending event1.

Signed-off-by: James Yao 
Reviewed-by: Jingwen Chen 
Signed-off-by: Alex Deucher

drm/amdgpu: extended waiting SRIOV VF reset completion timeout to 10s

2021-12-13T21:32:34+00:00

For the ASIC has big FB, it need more time to clear FB during reset.
This change extended SRIOV VF waiting reset completion timeout from 5s
to 10s.

Signed-off-by: Zhigang Luo 
Acked-by: Shaoyun Liu 
Signed-off-by: Alex Deucher

drm/amd/amdgpu: Add ready_to_reset resp for vega10

2021-08-30T18:59:33+00:00

Send response to host after received the flr notification from host.
Port NV change to vega10.

Signed-off-by: YuBiao Wang 
Reviewed-by: Jingwen Chen 
Signed-off-by: Alex Deucher

drm/amdgpu/SRIOV: Extend VF reset request wait period

2020-12-15T16:35:35+00:00

In Virtualization case, when one VF is sending too many
FLR requests, hypervisor would stop responding to this
VF's request for a long period of time. This is called
event guard. During this period of cooling time, guest
driver should wait instead of doing other things. After
this period of time, guest driver would resume reset
process and return to normal.

Currently, guest driver would wait 12 seconds and return fail
if it doesn't get response from host.

Solution: extend this waiting time in guest driver and poll
response periodically. Poll happens every 6 seconds and it will
last for 60 seconds.

v2: change the max repetition times from number to macro.

Signed-off-by: Jiange Zhao 
Acked-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: extent threshold of waiting FLR_COMPLETE

2020-04-24T15:42:11+00:00

to 5s to satisfy WHOLE GPU reset which need 3+ seconds to
finish

Signed-off-by: Monk Liu 
Acked-by: Yintian Tao 
Signed-off-by: Alex Deucher