summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu
AgeCommit message (Collapse)Author
2025-10-20drm/amdgpu: Remove unused members in amdgpu_mmanLijo Lazar
Discovery related members are now part of amdgpu_discovery_info. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: query block error count of ras moduleYiPeng Chai
Query block error count of ras module. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: Add logic for VF data exchange region to init from dynamic ↵Ellen Pan
crit_region offsets 1. Added VF logic to init data exchange region using the offsets from dynamic(v2) critical regions; Signed-off-by: Ellen Pan <yunru.pan@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: Add logic for VF ipd and VF bios to init from dynamic ↵Ellen Pan
crit_region offsets 1. Added VF logic in amdgpu_virt to init IP discovery using the offsets from dynamic(v2) critical regions; 2. Added VF logic in amdgpu_virt to init bios image using the offsets from dynamic(v2) critical regions; Signed-off-by: Ellen Pan <yunru.pan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: Reuse fw_vram_usage_* for dynamic critical region in SRIOVEllen Pan
- During guest driver init, asa VFs receive PF msg to init dynamic critical region(v2), VFs reuse fw_vram_usage_* from ttm to store critical region tables in a 5MB chunk. Signed-off-by: Ellen Pan <yunru.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: Introduce SRIOV critical regions v2 during VF initEllen Pan
1. Introduced amdgpu_virt_init_critical_region during VF init. - VFs use init_data_header_offset and init_data_header_size_kb transmitted via PF2VF mailbox to fetch the offset of critical regions' offsets/sizes in VRAM and save to adev->virt.crit_region_offsets and adev->virt.crit_region_sizes_kb. Signed-off-by: Ellen Pan <yunru.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: Add SRIOV crit_region_version supportEllen Pan
1. Added enum amd_sriov_crit_region_version to support multi versions 2. Added logic in SRIOV mailbox to regonize crit_region version during req_gpu_init_data Signed-off-by: Ellen Pan <yunru.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: Updated naming of SRIOV critical region offsets/sizes with _V1 ↵Ellen Pan
suffix - This change prepares the later patches to intro _v2 suffix to SRIOV critical regions Signed-off-by: Ellen Pan <yunru.pan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: query bad page info of ras moduleYiPeng Chai
Query bad page info of ras module. V2: Update code to reuse bad page output code. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: ras module supports error injectionYiPeng Chai
ras module supports error injection. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amd: Fix set but not used warningsTiezhu Yang
There are many set but not used warnings under drivers/gpu/drm/amd when compiling with the latest upstream mainline GCC: drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c:305:18: warning: variable ‘p’ set but not used [-Wunused-but-set-variable=] drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h:103:26: warning: variable ‘internal_reg_offset’ set but not used [-Wunused-but-set-variable=] ... drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h:164:26: warning: variable ‘internal_reg_offset’ set but not used [-Wunused-but-set-variable=] ... drivers/gpu/drm/amd/amdgpu/../display/dc/dc_dmub_srv.c:445:13: warning: variable ‘pipe_idx’ set but not used [-Wunused-but-set-variable=] drivers/gpu/drm/amd/amdgpu/../display/dc/dc_dmub_srv.c:875:21: warning: variable ‘pipe_idx’ set but not used [-Wunused-but-set-variable=] Remove the variables actually not used or add __maybe_unused attribute for the variables actually used to fix them, compile tested only. Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: Add ras module ip block to amdgpu discoveryYiPeng Chai
Add ras module ip block to amdgpu discovery. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: check save count before RAS bad page savingTao Zhou
It's possible that unit_num is larger than 0 but save_count is zero, since we do get bad page address but the address is invalid. Check unit_num and save_count together. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Candice Li <candice.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: add the kernel docs for alloc/free/valid rangeSunil Khatri
Add kernel docs for the functions related to hmm_range. Documents added for functions: amdgpu_hmm_range_valid amdgpu_hmm_range_alloc amdgpu_hmm_range_free Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: use GPU_HDP_FLUSH for sriovVictor Zhao
Currently SRIOV runtime will use kiq to write HDP_MEM_FLUSH_CNTL for hdp flush. This register need to be write from CPU for nbif to aware, otherwise it will not work. Implement amdgpu_kiq_hdp_flush and use kiq to do gpu hdp flush during sriov runtime. v2: - fallback to amdgpu_asic_flush_hdp when amdgpu_kiq_hdp_flush failed - add function amdgpu_mes_hdp_flush v3: - changed returned error Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: Add kiq hdp flush callbacksVictor Zhao
Add kiq hdp flush callbacks for gfx ips to support gpu hdp flush when no ring presents Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amd: Add a helper to tell whether an IP block HW is enabledMario Limonciello
There is already a helper for telling if a block is valid, but if IP handling wants to check if it's HW is enabled no such helper exists. Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: Fix vram_usage underflowAlysa Liu
vram_usage was subtracting non-vram memory size, which caused it to become negative. Signed-off-by: Alysa Liu <Alysa.Liu@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: Use memset32 for IB paddingTvrtko Ursulin
Use memset32 instead of open coding it, just because it is that bit nicer. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: Add ras module eeprom safety watermark checkYiPeng Chai
Add ras module eeprom safety watermark check. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: Avoid hive seqno increment in legacy rasYiPeng Chai
The hive->event_mgr variable is used by both ras module and legacy ras. To ensure the continuity of hive seqno growth, after enabling ras module, it is forbidden to operate the event_mgr variable in legacy ras. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: Avoid loading bad pages into legacy rasYiPeng Chai
When ras module is enabled, the bad pages will be loaded by ras module. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: add ras module rma checkYiPeng Chai
Add ras module rma check. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: Improve ras fatal error handling functionYiPeng Chai
In multi-gpu case, a fatal error will generate several fatal error interrupts. After improving this function, the ras module can reuse this function to only handle the first interrupt. V3: Initialize event_id using RAS_EVENT_INVALID_ID. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20drm/amdgpu: Intercept ras interrupts to ras moduleYiPeng Chai
Intercept ras interrupts to ras module. V2: Change function names in ras module. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-18drm/client: Remove holds_console_lock parameter from suspend/resumeThomas Zimmermann
No caller of the client resume/suspend helpers holds the console lock. The last such cases were removed from radeon in the patch series at [1]. Now remove the related parameter and the TODO items. v2: - update placeholders for CONFIG_DRM_CLIENT=n Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/series/151624/ # [1] Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Petr Vorel <pvorel@suse.cz> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Acked-by: Danilo Krummrich <dakr@kernel.org> Reviewed-by: Lyude Paul <lyude@redhat.com> Reviewed-by: Jocelyn Falempe <jfalempe@redhat.com> Link: https://lore.kernel.org/r/20251001143709.419736-1-tzimmermann@suse.de
2025-10-13drm/amdgpu: update remove after reset flag for MES remove queueJonathan Kim
Remove queue after reset flag is required to remove a queue that has been successfully reset to clean up the MES' internal state. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: Add ras module files into amdgpuYiPeng Chai
Add ras module files into amdgpu. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu/userqueue: validate userptrs for userqueuesSunil Khatri
userptrs could be changed by the user at any time and hence while locking all the bos before GPU start processing validate all the userptr bos. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: update the functions to use amdgpu version of hmmSunil Khatri
At times we need a bo reference for hmm and for that add a new struct amdgpu_hmm_range which will hold an optional bo member and hmm_range. Use amdgpu_hmm_range instead of hmm_range and let the bo as an optional argument for the caller if they want to the bo reference to be taken or they want to handle that explicitly. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: Reserve discovery TMR only if neededLijo Lazar
For legacy SOCs, discovery binary is sideloaded. Instead of checking for binary blob, use a flag to determine if discovery region needs to be reserved. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: Move reset-on-init sequence earlierLijo Lazar
Complete reset-on-init sequence before sysfs interfaces are created. Devices get properly initiaized only after reset, and then only sysfs interfaces should be made available. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: Add amdgpu_discovery_infoLijo Lazar
Add amdgpu_discovery_info structure to keep all discovery related information. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: Reorganize sysfs ini/fini callsLijo Lazar
Aggregate sysfs ini/fini calls into separate functions. No functional change. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: clean up and unify hw fence handlingAlex Deucher
Decouple the amdgpu fence from the amdgpu_job structure. This lets us clean up the separate fence ops for the embedded fence and other fences. This also allows us to allocate the vm fence up front when we allocate the job. v2: Additional cleanup suggested by Christian v3: Additional cleanups suggested by Christian v4: Additional cleanups suggested by David and vm fence fix v5: cast seqno (David) Cc: David.Wu3@amd.com Cc: christian.koenig@amd.com Tested-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu/userq: drop VCN and VPE doorbell handlingAlex Deucher
VCN and VPE userqs are not yet supported and this code is not correct. Userspace should provide the correct doorbell offset with in their doorbell page for the IP. Adjusting it here will not work as expected as userspace and the queue itself will have different offsets. We need to add a INFO IOCTL query to get the offset and range for each IP within the doorbell page to handle this properly. Cc: Saleemkhan Jamadar <saleemkhan.jamadar@amd.com> Reviewed-by: Saleemkhan Jamadar <saleemkhan.jamadar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amd: Pass userq suspend failures up to callerMario Limonciello
If a userq failed to suspend the rest of the suspend sequence may have problems. Pass the error code up to the caller for a decision on what to do. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amd: Fix error handling with multiple userq IDRsMario Limonciello
If multiple userq IDR are in use and there is an error handling one at suspend or resume it will be silently discarded. Switch the suspend/resume() code to use guards and return immediately. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amd: Pass IP suspend errors up to callersMario Limonciello
If IP suspend fails the callers should be notified so that they can potentially react. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amd: Don't always set IP block HW status to falseMario Limonciello
amdgpu_device_ip_suspend_phase2() calls amdgpu_ip_block_suspend() which already sets HW block status to false when succeeding with IP suspend. Remove the explicit call in amdgpu_device_ip_suspend_phase2() so that the status is accurate. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amd: Remove comment about handling errors in ↵Mario Limonciello
amdgpu_device_ip_suspend_phase1() Error handling was introduced in commit e095026f0066 ("drm/amdgpu: validate suspend before function call") so the comment about TODO is no longer needed. Fixes: e095026f0066 ("drm/amdgpu: validate suspend before function call") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amd: Stop exporting amdgpu_device_ip_suspend() outside amdgpu_deviceMario Limonciello
amdgpu_device_ip_suspend() doesn't have a caller outside of amdgpu_device.c. Make it static. No intended functional changes. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amd: Unify shutdown() callback behaviorMario Limonciello
[Why] The shutdown() callback uses amdgpu_ip_suspend() which doesn't notify drm clients during shutdown. This could lead to hangs. [How] Change amdgpu_pci_shutdown() to call the same sequence as suspend/resume. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: validate userq va for GEM unmapPrike Liang
When a user unmaps a userq VA, the driver must ensure the queue has no in-flight jobs. If there is pending work, the kernel should wait for the attached eviction (bookkeeping) fence to signal before deleting the mapping. Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: validate the queue va for resuming the queuePrike Liang
It requires validating the userq VA whether is mapped before trying to resume the queue. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: keeping waiting userq fence infinitelyPrike Liang
Keeping waiting the userq fence infinitely until hang detection, and then suspend the hang queue and set the fence error. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: track the userq bo va for its obj managementPrike Liang
Track the userq obj for its life time, and reference and dereference the buffer flag at its creating and destroying period. Suggested-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: add userq object va track helpersPrike Liang
Add the userq object virtual address list_add() helpers for tracking the userq obj va address usage. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: reduce queue timeout to 2 seconds v2Christian König
There has been multiple complains that 10 seconds are usually to long. The original requirement for longer timeout came from compute tests on AMDVLK, since that is no longer a topic reduce the timeout back to 2 seconds for all queues. While at it also remove any special handling for compute queues under SRIOV or pass through. v2: fix checkpatch warning. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu/mes: adjust the VMID masksAlex Deucher
The firmware limits the max vmid, but align the settings with the hw limits as well just to be safe. Reviewed-by: Shaoyun liu <Shaoyun.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>