linux-toradex.git/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h, branch v5.5-rc7

drm/amdgpu: update parameter of ras_ih_cb

2019-10-03T14:11:01+00:00

change struct ras_err_data *err_data to void *err_data, align with
umc code and the callback's declaration in each ras block could
pay no attention to the structure type

Signed-off-by: Tao Zhou 
Reviewed-by: Guchun Chen 
Signed-off-by: Alex Deucher

drm/amdgpu: fix ras ctrl debugfs node leak

2019-09-16T20:30:38+00:00

Use debugfs_remove_recursive to remove the whole debugfs
directory instead of removing the node one by one.

Signed-off-by: Guchun Chen 
Reviewed-by: Christian König 
Signed-off-by: Alex Deucher

drm/amdgpu: Fix mutex lock from atomic context.

2019-09-16T15:09:59+00:00

Problem:
amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu
because writing to EEPROM during ASIC reset was unstable.
But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called
directly from ISR context and so locking is not allowed. Also it's
irrelevant for this partilcular interrupt as this is generic RAS
interrupt and not memory errors specific.

Fix:
Avoid calling amdgpu_ras_reserve_bad_pages if not in task context.

Signed-off-by: Andrey Grodzovsky 
Reviewed-by: Tao Zhou 
Reviewed-by: Guchun Chen 
Signed-off-by: Alex Deucher

drm/amdgpu: move the call of ras recovery_init and bad page reserve to proper place

2019-09-13T22:50:47+00:00

ras recovery_init should be called after ttm init,
bad page reserve should be put in front of gpu reset since i2c
may be unstable during gpu reset.
add cleanup for recovery_init and recovery_fini

v2: add more comment and print.
    remove cancel_work_sync in recovery_init.

Signed-off-by: Tao Zhou 
Reviewed-by: Guchun Chen 
Signed-off-by: Alex Deucher

drm/amdgpu: save umc error records

2019-09-13T22:50:40+00:00

save umc error records to ras bad page array

v2: add bad pages before gpu reset
v3: add NULL check for adev->umc.funcs

Signed-off-by: Tao Zhou 
Signed-off-by: Andrey Grodzovsky 
Reviewed-by: Guchun Chen 
Signed-off-by: Alex Deucher

drm/amdgpu: change ras bps type to eeprom table record structure

2019-09-13T22:50:26+00:00

change bps type from retired page to eeprom table record, prepare for
saving umc error records to eeprom

Signed-off-by: Tao Zhou 
Reviewed-by: Guchun Chen 
Signed-off-by: Alex Deucher

dmr/amdgpu: Add system auto reboot to RAS.

2019-09-13T22:41:17+00:00

In case of RAS error allow user configure auto system
reboot through ras_ctrl.
This is also part of the temproray work around for the RAS
hang problem.

v4: Use latest kernel API for disk sync.

Signed-off-by: Andrey Grodzovsky 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: Avoid HW GPU reset for RAS.

2019-09-13T22:41:05+00:00

Problem:
Under certain conditions, when some IP bocks take a RAS error,
we can get into a situation where a GPU reset is not possible
due to issues in RAS in SMU/PSP.

Temporary fix until proper solution in PSP/SMU is ready:
When uncorrectable error happens the DF will unconditionally
broadcast error event packets to all its clients/slave upon
receiving fatal error event and freeze all its outbound queues,
err_event_athub interrupt  will be triggered.
In such case and we use this interrupt
to issue GPU reset. THe GPU reset code is modified for such case to avoid HW
reset, only stops schedulers, deatches all in progress and not yet scheduled
job's fences, set error code on them and signals.
Also reject any new incoming job submissions from user space.
All this is done to notify the applications of the problem.

v2:
Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev
Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c
Remove print param from amdgpu_ras_query_error_count

v3:
Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset
for other XGMI hive memebers.

Signed-off-by: Andrey Grodzovsky 
Acked-by: Felix Kuehling 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: add helper function to do common ras_late_init/fini (v3)

2019-09-13T22:11:04+00:00

In late_init for ras, the helper function will be used to
1). disable ras feature if the IP block is masked as disabled
2). send enable feature command if the ip block was masked as enabled
3). create debugfs/sysfs node per IP block
4). register interrupt handler

v2: check ih_info.cb to decide add interrupt handler or not

v3: add ras_late_fini for cleanup all the ras fs node and remove
interrupt handler

Signed-off-by: Hawking Zhang 
Reviewed-by: Alex Deucher 
Reviewed-by: Tao Zhou 
Signed-off-by: Alex Deucher

drm/amdgpu: Add RAS EEPROM table.

2019-08-27T13:17:14+00:00

Add RAS EEPROM table manager to eanble RAS errors to be stored
upon appearance and retrived on driver load.

v2: Fix some prints.

v3:
Fix checksum calculation.
Make table record and header structs packed to do correct byte value sum.
Fix record crossing EEPROM page boundry.

v4:
Fix byte sum val calculation for record - look at sizeof(record).
Fix some style comments.

v5: Add description to EEPROM_TABLE_RECORD_SIZE and syntax fixes.

Signed-off-by: Andrey Grodzovsky 
Reviewed-by: Luben Tuikov 
Signed-off-by: Alex Deucher