linux-toradex.git/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h, branch v6.0-rc4

drm/amdgpu: add dummy event6 for vega10

2022-01-07T22:19:34+00:00

[why]
Malicious mailbox event1 fails driver loading on vega10.
A dummy event6 prevent driver from taking response from malicious event1 as its own.

[how]
On vega10, send a mailbox event6 before sending event1.

Signed-off-by: James Yao 
Reviewed-by: Jingwen Chen 
Signed-off-by: Alex Deucher

drm/amdgpu: extended waiting SRIOV VF reset completion timeout to 10s

2021-12-13T21:32:34+00:00

For the ASIC has big FB, it need more time to clear FB during reset.
This change extended SRIOV VF waiting reset completion timeout from 5s
to 10s.

Signed-off-by: Zhigang Luo 
Acked-by: Shaoyun Liu 
Signed-off-by: Alex Deucher

drm/amd/amdgpu: Add ready_to_reset resp for vega10

2021-08-30T18:59:33+00:00

Send response to host after received the flr notification from host.
Port NV change to vega10.

Signed-off-by: YuBiao Wang 
Reviewed-by: Jingwen Chen 
Signed-off-by: Alex Deucher

drm/amdgpu/SRIOV: Extend VF reset request wait period

2020-12-15T16:35:35+00:00

In Virtualization case, when one VF is sending too many
FLR requests, hypervisor would stop responding to this
VF's request for a long period of time. This is called
event guard. During this period of cooling time, guest
driver should wait instead of doing other things. After
this period of time, guest driver would resume reset
process and return to normal.

Currently, guest driver would wait 12 seconds and return fail
if it doesn't get response from host.

Solution: extend this waiting time in guest driver and poll
response periodically. Poll happens every 6 seconds and it will
last for 60 seconds.

v2: change the max repetition times from number to macro.

Signed-off-by: Jiange Zhao 
Acked-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: extent threshold of waiting FLR_COMPLETE

2020-04-24T15:42:11+00:00

to 5s to satisfy WHOLE GPU reset which need 3+ seconds to
finish

Signed-off-by: Monk Liu 
Acked-by: Yintian Tao 
Signed-off-by: Alex Deucher

drm/amdgpu: cleanup idh event/req for NV headers

2020-04-01T18:44:43+00:00

1) drop the headers from AI in mxgpu_nv.c, should refer to mxgpu_nv.h

2) the IDH_EVENT_MAX is not used and not aligned with host side
   so drop it
3) the IDH_TEXT_MESSAG was provided in host but not defined in guest

Signed-off-by: Monk Liu 
Reviewed-by: Emily Deng 
Signed-off-by: Alex Deucher

drm/amd/powerplay: enable pp one vf mode for vega10

2019-12-11T20:22:07+00:00

Originally, due to the restriction from PSP and SMU, VF has
to send message to hypervisor driver to handle powerplay
change which is complicated and redundant. Currently, SMU
and PSP can support VF to directly handle powerplay
change by itself. Therefore, the old code about the handshake
between VF and PF to handle powerplay will be removed and VF
will use new the registers below to handshake with SMU.
mmMP1_SMN_C2PMSG_101: register to handle SMU message
mmMP1_SMN_C2PMSG_102: register to handle SMU parameter
mmMP1_SMN_C2PMSG_103: register to handle SMU response

v2: remove module parameter pp_one_vf
v3: fix the parens
v4: forbid vf to change smu feature
v5: use hwmon_attributes_visible to skip sepicified hwmon atrribute
v6: change skip condition at vega10_copy_table_to_smc

Signed-off-by: Yintian Tao 
Acked-by: Evan Quan 
Reviewed-by: Kenneth Feng 
Signed-off-by: Alex Deucher

drm/amdgpu: Add IDH_QUERY_ALIVE event for SR-IOV

2019-05-06T14:36:48+00:00

SR-IOV host side will send IDH_QUERY_ALIVE to guest VM to check
if this guest VM is still alive (not destroyed). The only thing
guest KMD need to do is to send ACK back to host.

Signed-off-by: Trigger Huang 
Acked-by: Alex Deucher 
Signed-off-by: Alex Deucher

drm/amdgpu: support dpm level modification under virtualization v3

2019-04-10T18:53:27+00:00

Under vega10 virtualuzation, smu ip block will not be added.
Therefore, we need add pp clk query and force dpm level function
at amdgpu_virt_ops to support the feature.

v2: add get_pp_clk existence check and use kzalloc to allocate buf

v3: return -ENOMEM for allocation failure and correct the coding style

Signed-off-by: Yintian Tao 
Reviewed-by: Alex Deucher 
Signed-off-by: Alex Deucher

drm/amdgpu: refactoring mailbox to fix TDR handshake bugs(v2)

2018-03-14T19:38:27+00:00

this patch actually refactor mailbox implmentations, and
all below changes are needed together to fix all those mailbox
handshake issues exposured by heavey TDR test.

1)refactor all mailbox functions based on byte accessing for mb_control
reason is to avoid touching non-related bits when writing trn/rcv part of
mailbox_control, this way some incorrect INTR sent to hypervisor
side could be avoided, and it fixes couple handshake bug.

2)trans_msg function re-impled: put a invalid
logic before transmitting message to make sure the ACK bit is in
a clear status, otherwise there is chance that ACK asserted already
before transmitting message and lead to fake ACK polling.
(hypervisor side have some tricks to workaround ACK bit being corrupted
by VF FLR which hase an side effects that may make guest side ACK bit
asserted wrongly), and clear TRANS_MSG words after message transferred.

3)for mailbox_flr_work, it is also re-worked: it takes the mutex lock
first if invoked, to block gpu recover's participate too early while
hypervisor side is doing VF FLR. (hypervisor sends FLR_NOTIFY to guest
before doing VF FLR and sentds FLR_COMPLETE after VF FLR done, and
the FLR_NOTIFY will trigger interrupt to guest which lead to
mailbox_flr_work being invoked)

This can avoid the issue that mailbox trans msg being cleared by its VF FLR.

4)for mailbox_rcv_irq IRQ routine, it should only peek msg and schedule
mailbox_flr_work, instead of ACK to hypervisor itself, because FLR_NOTIFY
msg sent from hypervisor side doesn't need VF's ACK (this is because
VF's ACK would lead to hypervisor clear its trans_valid/msg, and this
would cause handshake bug if trans_valid/msg is cleared not due to
correct VF ACK but from a wrong VF ACK like this "FLR_NOTIFY" one)

This fixed handshake bug that sometimes GUEST always couldn't receive
"READY_TO_ACCESS_GPU" msg from hypervisor.

5)seperate polling time limite accordingly:
POLL ACK cost no more than 500ms
POLL MSG cost no more than 12000ms
POLL FLR finish cost no more than 500ms

6) we still need to set adev into in_gpu_reset mode after we received
FLR_NOTIFY from host side, this can prevent innocent app wrongly succesed
to open amdgpu dri device.

FLR_NOFITY is received due to an IDLE hang detected from hypervisor side
which indicating GPU is already die in this VF.

v2:
use MACRO as the offset of mailbox_control register
don't test if NOTIFY_CMPL event in rcv_msg since it won't
recieve that message anymore

Signed-off-by: Monk Liu 
Reviewed-by: Pixel Ding 
Signed-off-by: Alex Deucher