linux-toradex.git/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c, branch v5.12

gpu/drm: ring_mirror_list --> pending_list

2020-12-08T13:38:03+00:00

Rename "ring_mirror_list" to "pending_list",
to describe what something is, not what it does,
how it's used, or how the hardware implements it.

This also abstracts the actual hardware
implementation, i.e. how the low-level driver
communicates with the device it drives, ring, CAM,
etc., shouldn't be exposed to DRM.

The pending_list keeps jobs submitted, which are
out of our control. Usually this means they are
pending execution status in hardware, but the
latter definition is a more general (inclusive)
definition.

Signed-off-by: Luben Tuikov 
Acked-by: Christian König 
Link: https://patchwork.freedesktop.org/patch/405573/

Cc: Alexander Deucher 
Cc: Andrey Grodzovsky 
Cc: Christian König 
Cc: Daniel Vetter 
Signed-off-by: Christian König

drm/scheduler: "node" --> "list"

2020-12-08T13:37:55+00:00

Rename "node" to "list" in struct drm_sched_job,
in order to make it consistent with what we see
being used throughout gpu_scheduler.h, for
instance in struct drm_sched_entity, as well as
the rest of DRM and the kernel.

Signed-off-by: Luben Tuikov 
Reviewed-by: Christian König 
Link: https://patchwork.freedesktop.org/patch/403515/

Cc: Alexander Deucher 
Cc: Andrey Grodzovsky 
Cc: Christian König 
Cc: Daniel Vetter 
Signed-off-by: Christian König

drm/scheduler: Scheduler priority fixes (v2)

2020-08-18T22:20:17+00:00

Remove DRM_SCHED_PRIORITY_LOW, as it was used
in only one place.

Rename and separate by a line
DRM_SCHED_PRIORITY_MAX to DRM_SCHED_PRIORITY_COUNT
as it represents a (total) count of said
priorities and it is used as such in loops
throughout the code. (0-based indexing is the
the count number.)

Remove redundant word HIGH in priority names,
and rename *KERNEL* to *HIGH*, as it really
means that, high.

v2: Add back KERNEL and remove SW and HW,
    in lieu of a single HIGH between NORMAL and KERNEL.

Signed-off-by: Luben Tuikov 
Reviewed-by: Christian König 
Signed-off-by: Alex Deucher

drm/amdgpu: revert "fix system hang issue during GPU reset"

2020-08-14T20:22:40+00:00

The whole approach wasn't thought through till the end.

We already had a reset lock like this in the past and it caused the same problems like this one.

Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary.

This reverts commit df9c8d1aa278c435c30a69b8f2418b4a52fcb929.

Signed-off-by: Christian König 
Acked-by: Alex Deucher 
Signed-off-by: Alex Deucher

drm/amdgpu: fix system hang issue during GPU reset

2020-07-27T20:21:37+00:00

when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
re-entering GPU recovery.

During GPU reset and resume, it is unsafe that other threads access GPU,
which maybe cause GPU reset failed. Therefore the new rw_semaphore
adev->reset_sem is introduced, which protect GPU from being accessed by
external threads during recovery.

v2:
1. add rwlock for some ioctls, debugfs and file-close function.
2. change to use dqm->is_resetting and dqm_lock for protection in kfd
driver.
3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
re-enter GPU recovery for the same GPU hang.

v3:
1. change back to use adev->reset_sem to protect kfd callback
functions, because dqm_lock couldn't protect all codes, for example:
free_mqd must be called outside of dqm_lock;

[ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
[ 1230.177221] Call Trace:
[ 1230.178249]  dump_stack+0x98/0xd5
[ 1230.179443]  amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
[ 1230.180673]  gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
[ 1230.181882]  amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
[ 1230.183098]  amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
[ 1230.184239]  ? ttm_bo_put+0x171/0x5f0 [ttm]
[ 1230.185394]  ttm_tt_unbind+0x21/0x40 [ttm]
[ 1230.186558]  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
[ 1230.187707]  ttm_tt_destroy+0x13/0x20 [ttm]
[ 1230.188832]  ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
[ 1230.189979]  ttm_bo_put+0x1be/0x5f0 [ttm]
[ 1230.191230]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[ 1230.192522]  amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
[ 1230.193833]  free_mqd+0x25/0x40 [amdgpu]
[ 1230.195143]  destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
[ 1230.196475]  pqm_destroy_queue+0x105/0x260 [amdgpu]
[ 1230.197819]  kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
[ 1230.199154]  kfd_ioctl+0x277/0x500 [amdgpu]
[ 1230.200458]  ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
[ 1230.201656]  ? tomoyo_file_ioctl+0x19/0x20
[ 1230.202831]  ksys_ioctl+0x98/0xb0
[ 1230.204004]  __x64_sys_ioctl+0x1a/0x20
[ 1230.205174]  do_syscall_64+0x5f/0x250
[ 1230.206339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

2. remove try_lock and introduce atomic hive->in_reset, to avoid
re-enter GPU recovery.

v4:
1. remove an unnecessary whitespace change in kfd_chardev.c
2. remove comment codes in amdgpu_device.c
3. add more detailed comment in commit message
4. define a wrap function amdgpu_in_reset

v5:
1. Fix some style issues.

Reviewed-by: Hawking Zhang 
Suggested-by: Andrey Grodzovsky 
Suggested-by: Christian König 
Suggested-by: Felix Kuehling 
Suggested-by: Lijo Lazar 
Suggested-by: Luben Tukov 
Signed-off-by: Dennis Li 
Signed-off-by: Alex Deucher

drm/amdgpu: don't do soft recovery if gpu_recovery=0

2020-07-10T21:40:39+00:00

It's impossible to debug shader hangs with soft recovery.

Signed-off-by: Marek Olšák 
Reviewed-by: Alex Deucher 
Reviewed-by: Christian König 
Signed-off-by: Alex Deucher

drm/amdgpu: remove distinction between explicit and implicit sync (v2)

2020-07-01T05:59:22+00:00

According to Marek a pipeline sync should be inserted for implicit syncs well.

v2: bump the driver version

Signed-off-by: Christian König 
Tested-by: Marek Olšák 
Signed-off-by: Marek Olšák 
Signed-off-by: Alex Deucher

drm/amdgpu: remove set but not used variable 'priority'

2020-04-23T19:06:41+00:00

drivers/gpu/drm/amd/amdgpu/amdgpu_job.c: In function amdgpu_job_submit:
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:148:26: warning: variable priority set but not used [-Wunused-but-set-variable]

commit 33abcb1f5a17 ("drm/amdgpu: set compute queue priority at mqd_init")
left behind this, remove it.

Reviewed-by: Christian König 
Signed-off-by: YueHaibing 
Signed-off-by: Alex Deucher

drm/amdgpu: restrict debugfs register access under SR-IOV

2020-04-13T16:01:04+00:00

Under bare metal, there is no more else to take
care of the GPU register access through MMIO.
Under Virtualization, to access GPU register is
implemented through KIQ during run-time due to
world-switch.

Therefore, under SR-IOV user can only access
debugfs to r/w GPU registers when meets all
three conditions below.
- amdgpu_gpu_recovery=0
- TDR happened
- in_gpu_reset=0

v2: merge amdgpu_virt_can_access_debugfs() into
    amdgpu_virt_enable_access_debugfs()

v3: drop ret variable in amdgpu_virt_enable_access_debugfs()
    and directly return result

Signed-off-by: Yintian Tao 
Acked-by: Alex Deucher 
Signed-off-by: Alex Deucher

drm/amdgpu: implement more ib pools (v2)

2020-04-01T18:44:44+00:00

We have three ib pools, they are normal, VM, direct pools.

Any jobs which schedule IBs without dependence on gpu scheduler should
use DIRECT pool.

Any jobs schedule direct VM update IBs should use VM pool.

Any other jobs use NORMAL pool.

v2: squash in coding style fix

Signed-off-by: xinhui pan 
Reviewed-by: Christian König 
Signed-off-by: Alex Deucher