summaryrefslogtreecommitdiff
path: root/drivers/accel
AgeCommit message (Collapse)Author
9 daysaccel/amdxdna: Move RPM resume into job run functionLizhi Hou
Currently, amdxdna_pm_resume_get() is called during job creation, and amdxdna_pm_suspend_put() is called when the hardware notifies job completion. If a job is canceled before it is run, no hardware completion notification is generated, resulting in an unbalanced runtime PM resume/suspend pair. Fix this by moving amdxdna_pm_resume_get() to the job run path, ensuring runtime PM is only resumed for jobs that are actually executed. Fixes: 063db451832b ("accel/amdxdna: Enhance runtime power management") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260204171118.3165607-1-lizhi.hou@amd.com
9 daysaccel/amdxdna: Fix incorrect DPM level after suspend/resumeLizhi Hou
The suspend routine sets the DPM level to 0, which unintentionally overwrites the previously saved DPM level. As a result, the device always resumes with DPM level 0 instead of restoring the original value. Fix this by ensuring the suspend path does not overwrite the saved DPM level, allowing the correct DPM level to be restored during resume. Fixes: f4d7b8a6bc8c ("accel/amdxdna: Enhance power management settings") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260204171048.3165580-1-lizhi.hou@amd.com
10 daysaccel/amdxdna: Fix incorrect error code returned for failed chain commandLizhi Hou
The driver currently returns an incorrect error code when a chain command fails. In this case, ERT_CMD_STATE_ERROR is expected to be reported for failed chain commands. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260203184037.2751889-1-lizhi.hou@amd.com
10 daysaccel/amdxdna: Remove hardware context statusLizhi Hou
One newly supported command does not require hardware context configuration to be performed upfront. As a result, checking hardware context status causes this command to fail incorrectly. Remove hardware context status handling entirely. For other commands, if userspace submits a request without configuring the hardware context first, the firmware will report an error or time out as appropriate. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260202212450.2681273-1-lizhi.hou@amd.com
2026-01-30accel/amdxdna: Fix memory leak in amdxdna_ubuf_mapZishun Yi
The amdxdna_ubuf_map() function allocates memory for sg and internal sg table structures, but it fails to free them if subsequent operations (sg_alloc_table_from_pages or dma_map_sgtable) fail. Fixes: bd72d4acda10 ("accel/amdxdna: Support user space allocated buffer") Signed-off-by: Zishun Yi <zishun.yi.dev@gmail.com> Reviewed-by: Lizhi Hou <lizhi.hou@amd.com> Reviewed-by: Min Ma <mamin506@gmail.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260129171022.68578-1-zishun.yi.dev@gmail.com
2026-01-30accel/amdxdna: Stop job scheduling across aie2_release_resource()Lizhi Hou
Running jobs on a hardware context while it is in the process of releasing resources can lead to use-after-free and crashes. Fix this by stopping job scheduling before calling aie2_release_resource() and restarting it after the release completes. Additionally, aie2_sched_job_run() now checks whether the hardware context is still active. Fixes: 4fd6ca90fc7f ("accel/amdxdna: Refactor hardware context destroy routine") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260130003255.2083255-1-lizhi.hou@amd.com
2026-01-30accel/amdxdna: Hold mm structure across iommu_sva_unbind_device()Lizhi Hou
Some tests trigger a crash in iommu_sva_unbind_device() due to accessing iommu_mm after the associated mm structure has been freed. Fix this by taking an explicit reference to the mm structure after successfully binding the device, and releasing it only after the device is unbound. This ensures the mm remains valid for the entire SVA bind/unbind lifetime. Fixes: be462c97b7df ("accel/amdxdna: Add hardware context") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260128002356.1858122-1-lizhi.hou@amd.com
2026-01-16Merge tag 'drm-misc-next-2026-01-15' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/misc/kernel into drm-next drm-misc-next for 6.20: Core Changes: - atomic: Introduce Gamma/Degamma LUT size check - gem: Fix a leak in drm_gem_get_unmapped_area - gpuvm: API sanitation for Rust bindings - panic: Few corner-cases fixes Driver Changes: - Replace system workqueue with percpu equivalent - amdxdna: Update message buffer allocation requirements, Update firmware version check - imagination: Add AM62P support - ivpu: Implement warm boot flow - rockchip: Get rid of atomic_check fixups, Add Rockchip RK3506 Support - rocket: Cleanups - bridge: - dw-hdmi-qp: Add support for HPD-less setups - panel: - mantix: Various power management related improvements - new panels: Innolux G150XGE-L05, - dma-buf: - cma: Call clear_page instead of memset Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <mripard@redhat.com> Link: https://patch.msgid.link/20260115-lilac-dragon-of-opposition-ac0a30@houat
2026-01-14accel/amdxdna: Fix notifier_wq flushing warningLizhi Hou
Create notifier_wq with WQ_MEM_RECLAIM flag to fix the possible warning. workqueue: WQ_MEM_RECLAIM amdxdna_js:drm_sched_free_job_work [gpu_sched] is flushing !WQ_MEM_RECLAIM notifier_wq:0x0 Fixes: e486147c912f ("accel/amdxdna: Add BO import and export") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260113173624.256053-1-lizhi.hou@amd.com
2026-01-10accel/rocket: factor out code with find_core_for_dev in rocket_removeQuentin Schulz
There already is a function to return the offset of the core for a given struct device, so let's reuse that function instead of reimplementing the same logic. There's one change in behavior when a struct device is passed which doesn't match any core's. Before, we would continue through rocket_remove() but now we exit early, to match what other callers of find_core_for_dev() (rocket_device_runtime_resume/suspend()) are doing. This however should never happen. Aside from that, no intended change in behavior. Signed-off-by: Quentin Schulz <quentin.schulz@cherry.de> Reviewed-by: Tomeu Vizoso <tomeu@tomeuvizoso.net> Signed-off-by: Tomeu Vizoso <tomeu@tomeuvizoso.net> Link: https://patch.msgid.link/20251215-rocket-reuse-find-core-v1-1-be86a1d2734c@cherry.de
2026-01-10accel/rocket: fix unwinding in error path in rocket_probeQuentin Schulz
When rocket_core_init() fails (as could be the case with EPROBE_DEFER), we need to properly unwind by decrementing the counter we just incremented and if this is the first core we failed to probe, remove the rocket DRM device with rocket_device_fini() as well. This matches the logic in rocket_remove(). Failing to properly unwind results in out-of-bounds accesses. Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL") Cc: stable@vger.kernel.org Signed-off-by: Quentin Schulz <quentin.schulz@cherry.de> Reviewed-by: Tomeu Vizoso <tomeu@tomeuvizoso.net> Signed-off-by: Tomeu Vizoso <tomeu@tomeuvizoso.net> Link: https://patch.msgid.link/20251215-rocket-error-path-v1-2-eec3bf29dc3b@cherry.de
2026-01-10accel/rocket: fix unwinding in error path in rocket_core_initQuentin Schulz
When rocket_job_init() is called, iommu_group_get() has already been called, therefore we should call iommu_group_put() and make the iommu_group pointer NULL. This aligns with what's done in rocket_core_fini(). If pm_runtime_resume_and_get() somehow fails, not only should rocket_job_fini() be called but we should also unwind everything done before that, that is, disable PM, put the iommu_group, NULLify it and then call rocket_job_fini(). This is exactly what's done in rocket_core_fini() so let's call that function instead of duplicating the code. Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL") Cc: stable@vger.kernel.org Signed-off-by: Quentin Schulz <quentin.schulz@cherry.de> Reviewed-by: Tomeu Vizoso <tomeu@tomeuvizoso.net> Signed-off-by: Tomeu Vizoso <tomeu@tomeuvizoso.net> Link: https://patch.msgid.link/20251215-rocket-error-path-v1-1-eec3bf29dc3b@cherry.de
2026-01-08accel/amdxdna: Update firmware version check for latest firmwareLizhi Hou
The latest firmware increases the major version number. Update aie2_check_protocol() to accept and support the new firmware version. Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251219014356.2234241-2-lizhi.hou@amd.com
2026-01-08accel/amdxdna: Update message DMA buffer allocationLizhi Hou
The latest firmware requires the message DMA buffer to - have a minimum size of 8K - use a power-of-two size - be aligned to the buffer size - not cross 64M boundary Update the buffer allocation logic to meet these requirements and support the latest firmware. Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251219014356.2234241-1-lizhi.hou@amd.com
2026-01-08accel/ivpu: Implement warm boot flow for NPU6 and unify boot handlingKarol Wachowski
Starting from NPU6, the driver can pass boot parameters address through the AON retention register and toggle between cold/warm boot types using the boot_type parameter, while setting the cold boot entry point in both cases. Refactor the existing cold/warm boot handling to be consistent with the new NPU6 boot flow requirements and still maintain compatibility with older boot flows. This will allow firmware to remove support for legacy warm boot starting from NPU6. Fixes: 550f4dd2cedd ("accel/ivpu: Add support for Nova Lake's NPU") Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Reviewed-by: Andrzej Kacprowski <andrzej.kacprowski@linux.intel.com> Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Link: https://patch.msgid.link/20251230142116.540026-1-maciej.falkowski@linux.intel.com
2025-12-26Merge tag 'drm-misc-next-2025-12-19' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/misc/kernel into drm-next drm-misc-next for 6.20: Core Changes: - dma-buf: Add tracepoints - sched: Introduce new helpers Driver Changes: - amdxdna: Enable hardware context priority, Remove (obsolete and never public) NPU2 Support, Race condition fix - rockchip: Add RK3368 HDMI Support - rz-du: Add RZ/V2H(P) MIPI-DSI Support - panels: - st7571: Introduce SPI support - New panels: Sitronix ST7920, Samsung LTL106HL02, LG LH546WF1-ED01, HannStar HSD156JUW2 Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <mripard@redhat.com> Link: https://patch.msgid.link/20251219-arcane-quaint-skunk-e383b0@houat
2025-12-26Merge tag 'drm-misc-next-2025-12-12' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/misc/kernel into drm-next drm-misc-next for 6.19: UAPI Changes: - panfrost: Add PANFROST_BO_SYNC ioctl - panthor: Add PANTHOR_BO_SYNC ioctl Core Changes: - atomic: Add drm_device pointer to drm_private_obj - bridge: Introduce drm_bridge_unplug, drm_bridge_enter, and drm_bridge_exit - dma-buf: Improve sg_table debugging - dma-fence: Add new helpers, and use them when needed - dp_mst: Avoid out-of-bounds access with VCPI==0 - gem: Reduce page table overhead with transparent huge pages - panic: Report invalid panic modes - sched: Add TODO entries - ttm: Various cleanups - vblank: Various refactoring and cleanups - Kconfig cleanups - Removed support for kdb Driver Changes: - amdxdna: Fix race conditions at suspend, Improve handling of zero tail pointers, Fix cu_idx being overwritten during command setup - ast: Support imported cursor buffers - - panthor: Enable timestamp propagation, Multiple improvements and fixes to improve the overall robustness, notably of the scheduler. - panels: - panel-edp: Support for CSW MNE007QB3-1, AUO B140HAN06.4, AUO B140QAX01.H Signed-off-by: Dave Airlie <airlied@redhat.com> [airlied: fix mm conflict] From: Maxime Ripard <mripard@redhat.com> Link: https://patch.msgid.link/20251212-spectacular-agama-of-abracadabra-aaef32@penduick
2025-12-18accel/amdxdna: Enable hardware context priorityLizhi Hou
Newer firmware supports hardware context priority. Set the priority based on application input. Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251217171719.2139025-1-lizhi.hou@amd.com
2025-12-18accel/amdxdna: Enable temporal sharing only modeLizhi Hou
Newer firmware versions prefer temporal sharing only mode. In this mode, the driver no longer needs to manage AIE array column allocation. Instead, a new field, num_unused_col, is added to the hardware context creation request to specify how many columns will not be used by this hardware context. Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251217191150.2145937-1-lizhi.hou@amd.com
2025-12-18accel/amdxdna: Remove NPU2 supportLizhi Hou
NPU2 hardware was never publicly released and is now obsolete. Remove all remaining NPU2 support from the driver. Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251217190818.2145781-1-lizhi.hou@amd.com
2025-12-17accel/amdxdna: Remove amdxdna_flush()Lizhi Hou
amdxdna_flush() was introduced to ensure that the device does not access a process address space after it has been freed. However, this is no longer necessary because the driver now increments the mm reference count when a command is submitted and decrements it only after the command has completed. This guarantees that the process address space remains valid for the entire duration of command execution. Remove amdxdna_flush to simplify the teardown path. Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251216031311.2033399-1-lizhi.hou@amd.com
2025-12-16accel/ivpu: Validate scatter-gather size against buffer sizeKarol Wachowski
Validate scatter-gather table size matches buffer object size before mapping. Break mapping early if the table exceeds buffer size to prevent overwriting existing mappings. Also validate the table is not smaller than buffer size to avoid unmapped regions that trigger MMU translation faults. Log error and fail mapping operation on size mismatch to prevent data corruption from mismatched host memory locations and NPU addresses. Unmap any partially mapped buffer on failure. Reviewed-by: Lizhi Hou <lizhi.hou@amd.com> Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com> Link: https://patch.msgid.link/20251215070933.520377-1-karol.wachowski@linux.intel.com
2025-12-15accel/amdxdna: Block running under a hypervisorMario Limonciello (AMD)
SVA support is required, which isn't configured by hypervisor solutions. Closes: https://github.com/QubesOS/qubes-issues/issues/10275 Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4656 Reviewed-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251213054513.87925-1-superm1@kernel.org Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
2025-12-15accel/amdxdna: Fix potential NULL pointer dereference in context cleanupLizhi Hou
aie_destroy_context() is invoked during error handling in aie2_create_context(). However, aie_destroy_context() assumes that the context's mailbox channel pointer is non-NULL. If mailbox channel creation fails, the pointer remains NULL and calling aie_destroy_context() can lead to a NULL pointer dereference. In aie2_create_context(), replace aie_destroy_context() with a function which request firmware to remove the context created previously. Fixes: be462c97b7df ("accel/amdxdna: Add hardware context") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251212183244.1826318-1-lizhi.hou@amd.com
2025-12-12accel/amdxdna: Fix race where send ring appears full due to delayed head updateLizhi Hou
The firmware sends a response and interrupts the driver before advancing the mailbox send ring head pointer. As a result, the driver may observe the response and attempt to send a new request before the firmware has updated the head pointer. In this window, the send ring still appears full, causing the driver to incorrectly fail the send operation. This race can be triggered more easily in a multithreaded environment, leading to unexpected and spurious "send ring full" failures. To address this, poll the send ring head pointer for up to 100us before returning a full-ring condition. This allows the firmware time to update the head pointer. Fixes: b87f920b9344 ("accel/amdxdna: Support hardware mailbox") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251211045125.1724604-1-lizhi.hou@amd.com
2025-12-10accel/amdxdna: Fix cu_idx being cleared by memset() during command setupLizhi Hou
For one command type, cu_idx is assigned before calling memset() on the command structure. This results in cu_idx being overwritten, causing the firmware to receive an incomplete or invalid command and leading to unexpected command failures. Fix this by moving the memset() call before initializing cu_idx so that all fields are populated in the correct order. Fixes: 71829d7f2f70 ("accel/amdxdna: Use MSG_OP_CHAIN_EXEC_NPU when supported") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251209211639.1636888-1-lizhi.hou@amd.com
2025-12-09accel/amdxdna: Fix race condition when checking rpm_onLizhi Hou
When autosuspend is triggered, driver rpm_on flag is set to indicate that a suspend/resume is already in progress. However, when a userspace application submits a command during this narrow window, amdxdna_pm_resume_get() may incorrectly skip the resume operation because the rpm_on flag is still set. This results in commands being submitted while the device has not actually resumed, causing unexpected behavior. The set_dpm() is called by suspend/resume, it relied on rpm_on flag to avoid calling into rpm suspend/resume recursivly. So to fix this, remove the use of the rpm_on flag entirely. Instead, introduce aie2_pm_set_dpm() which explicitly resumes the device before invoking set_dpm(). With this change, set_dpm() is called directly inside the suspend or resume execution path. Otherwise, aie2_pm_set_dpm() is called. Fixes: 063db451832b ("accel/amdxdna: Enhance runtime power management") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251208165356.1549237-1-lizhi.hou@amd.com
2025-12-08accel/amdxdna: Fix tail-pointer polling in mailbox_get_msg()Lizhi Hou
In mailbox_get_msg(), mailbox_reg_read_non_zero() is called to poll for a non-zero tail pointer. This assumed that a zero value indicates an error. However, certain corner cases legitimately produce a zero tail pointer. To handle these cases, remove mailbox_reg_read_non_zero(). The zero tail pointer will be treated as a valid rewind event. Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251204181603.793824-1-lizhi.hou@amd.com
2025-12-02accel/amdxdna: Poll MPNPU_PWAITMODE after requesting firmware suspendLizhi Hou
After issuing a firmware suspend request, the driver must ensure that the suspend operation has completed before proceeding. Add polling of the MPNPU_PWAITMODE register to confirm that the firmware has fully entered the suspended state. This prevents race conditions where subsequent operations assume the firmware is idle before it has actually completed its suspend sequence. Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251202165427.507414-1-lizhi.hou@amd.com
2025-11-13accel/amdxdna: Fix deadlock between context destroy and job timeoutLizhi Hou
Hardware context destroy function holds dev_lock while waiting for all jobs to complete. The timeout job also needs to acquire dev_lock, this leads to a deadlock. Fix the issue by temporarily releasing dev_lock before waiting for all jobs to finish, and reacquiring it afterward. Fixes: 4fd6ca90fc7f ("accel/amdxdna: Refactor hardware context destroy routine") Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251107181050.1293125-1-lizhi.hou@amd.com
2025-11-13accel/amdxdna: Clear mailbox interrupt register during channel creationLizhi Hou
The mailbox interrupt register is not always cleared when a mailbox channel is created. This can leave stale interrupt states from previous operations. Fix this by explicitly clearing the interrupt register in the mailbox channel creation function. Fixes: b87f920b9344 ("accel/amdxdna: Support hardware mailbox") Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251107181115.1293158-1-lizhi.hou@amd.com
2025-11-12accel/ivpu: Fix warning due to undefined CONFIG_PROC_FSKarol Wachowski
Change #if to #ifdef CONFIG_PROC_FS to fix warning reported by test robot: drivers/accel/ivpu/ivpu_drv.c:458:5: warning: "CONFIG_PROC_FS" is not defined, evaluates to 0 [-Wundef] Fixes: 63cc028484ab ("accel/ivpu: Add fdinfo support for memory statistics") Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Reviewed-by: Andrzej.Kacprowski@linux.intel.com Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com> Link: https://patch.msgid.link/20251112071911.1136934-1-karol.wachowski@linux.intel.com
2025-11-12accel/ivpu: Count only resident buffers in memory utilizationKarol Wachowski
Do not count buffer objects that have no backing pages, including imported buffers where pages are set by VM faults triggered by userspace or pinned by other drivers. Instead, return information about actual memory used by the NPU. Counting imported buffers results in incorrect calculations when the same pages are counted multiple times, giving overly high results. Fixes: 7bfc9fa99580 ("accel/ivpu: Expose NPU memory utilization info in sysfs") Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com> Link: https://patch.msgid.link/20251106101052.1050348-3-karol.wachowski@linux.intel.com
2025-11-12accel/ivpu: Add fdinfo support for memory statisticsKarol Wachowski
Implement DRM fdinfo interface to expose memory usage statistics for NPU device file descriptors. Exclude unpinned and imported buffers from resident memory calculations to provide accurate memory usage reporting. Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com> Link: https://patch.msgid.link/20251106101052.1050348-2-karol.wachowski@linux.intel.com
2025-11-07accel/qaic: Add qaic_ prefix to irq_polling_workZack McKevitt
Rename irq_polling_work to qaic_irq_polling_work to reduce ambiguity and avoid potential naming conflicts in the future. Signed-off-by: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Link: https://patch.msgid.link/20251031192511.3179130-1-zachary.mckevitt@oss.qualcomm.com
2025-11-07accel/qaic: Collect crashdump from SSR channelPranjal Ramajor Asha Kanojiya
After subsystem of the device has crashed it sends a message with command DEBUG_TRANSFER_INFO to kernel(host). Send ACK for that message and then prepare to collect the ramdump of the subsystem Steps of crashdump collection is as follows, 1) Device sends DEBUG_TRANSFER_INFO message indicating that device wants to send crashdump. 2) Send an acknowledgment to that message either ACK or NACK. a) NACK will inform the device that host will not download the crashdump b) ACK will inform the device that host will download the crashdump 3) Along with the DEBUG_TRANSFER_INFO we receive a table base address and its length, use that to download that table from device. a) This table is meta data of the crashdump and not the actual crashdump. 4) After we respond as ACK for message received on step 1) we start downloading the table. Use series of MEMORY_READ/MEMORY_READ_RSP SSR commands to download the entire table. 5) Each entry in the table represents a segment of crashdump. Once the table downloading is complete, iterate through each entry of table and download each crashdump segment(same as table itself). Table entry contains the memory base address and length along with other info. 6) After the entire crashdump is downloaded send DEBUG_TRANSFER_DONE which marks that host is terminating the crashdump transfer. This message can be send in both success or error case. 7) After receiving DEBUG_TRANSFER_DONE_RSP hand over the crashdump to dev_coredumpv() and free all the necessary memory. Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Co-developed-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com> Signed-off-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com> Signed-off-by: Pranjal Ramajor Asha Kanojiya <pkanojiy@codeaurora.org> Signed-off-by: Youssef Samir <youssef.abdulrahman@oss.qualcomm.com> Signed-off-by: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Link: https://patch.msgid.link/20251031174059.2814445-4-zachary.mckevitt@oss.qualcomm.com
2025-11-07accel/qaic: Implement basic SSR handlingJeffrey Hugo
Subsystem restart (SSR) for a qaic device means that a NSP has crashed, and will be restarted. However the restart process will lose any state associated with activation, so the user will need to do some recovery. While SSR has the provision to collect a crash dump, this patch does not implement support for it. Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Co-developed-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com> Signed-off-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com> Co-developed-by: Troy Hanson <quic_thanson@quicinc.com> Signed-off-by: Troy Hanson <quic_thanson@quicinc.com> Co-developed-by: Aswin Venkatesan <aswivenk@qti.qualcomm.com> Signed-off-by: Aswin Venkatesan <aswivenk@qti.qualcomm.com> Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org> Signed-off-by: Youssef Samir <youssef.abdulrahman@oss.qualcomm.com> Signed-off-by: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> [jhugo: Fix minor checkpatch whitespace issues] Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Link: https://patch.msgid.link/20251031174059.2814445-3-zachary.mckevitt@oss.qualcomm.com
2025-11-07accel/qaic: Add DMA Bridge Channel(DBC) sysfs and ueventsPranjal Ramajor Asha Kanojiya
Expose sysfs files for each DBC representing the current state of that DBC. For example, sysfs for DBC ID 0 and accel minor number 0 looks like this, /sys/class/accel/accel0/dbc0_state Following are the states and their corresponding values, DBC_STATE_IDLE (0) DBC_STATE_ASSIGNED (1) DBC_STATE_BEFORE_SHUTDOWN (2) DBC_STATE_AFTER_SHUTDOWN (3) DBC_STATE_BEFORE_POWER_UP (4) DBC_STATE_AFTER_POWER_UP (5) Signed-off-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com> Signed-off-by: Youssef Samir <youssef.abdulrahman@oss.qualcomm.com> Signed-off-by: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Link: https://patch.msgid.link/20251031174059.2814445-2-zachary.mckevitt@oss.qualcomm.com
2025-11-07accel/amdxdna: Treat power-off failure as unrecoverable errorLizhi Hou
Failing to set power off indicates an unrecoverable hardware or firmware error. Update the driver to treat such a failure as a fatal condition and stop further operations that depend on successful power state transition. This prevents undefined behavior when the hardware remains in an unexpected state after a failed power-off attempt. Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251106180521.1095218-1-lizhi.hou@amd.com
2025-11-06accel/amdxdna: Fix dma_fence leak when job is canceledLizhi Hou
Currently, dma_fence_put(job->fence) is called in job notification callback. However, if a job is canceled, the notification callback is never invoked, leading to a memory leak. Move dma_fence_put(job->fence) to the job cleanup function to ensure the fence is always released. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251105194140.1004314-1-lizhi.hou@amd.com
2025-11-05accel/qaic: Add support for PM callbacksYoussef Samir
Add initial support for suspend and hibernation PM callbacks to QAIC. The device can be suspended any time in which the data path is not busy as queued I/O operations are lost on suspension and cannot be resumed after suspend. Signed-off-by: Youssef Samir <youssef.abdulrahman@oss.qualcomm.com> Reviewed-by: Carl Vanderlip <carl.vanderlip@oss.qualcomm.com> Signed-off-by: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Link: https://patch.msgid.link/20251029181808.1216466-1-zachary.mckevitt@oss.qualcomm.com
2025-11-05accel/amdxdna: Support preemption requestsLizhi Hou
The driver checks the firmware version during initialization.If preemption is supported, the driver configures preemption accordingly and handles userspace preemption requests. Otherwise, the driver returns an error for userspace preemption requests. Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251104185340.897560-1-lizhi.hou@amd.com
2025-11-05accel/ivpu: Improve debug and warning messagesKarol Wachowski
Add IOCTL debug bit for logging user provided parameter validation errors. Refactor several warning and error messages to better reflect fault reason. User generated faults should not flood kernel messages with warnings or errors, so change those to ivpu_dbg(). Add additional debug logs for parameter validation in IOCTLs. Check size provided by in metric streamer start and return -EINVAL together with a debug message print. Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com> Link: https://patch.msgid.link/20251104132418.970784-1-karol.wachowski@linux.intel.com
2025-11-04accel/amdxdna: Add IOCTL parameter for telemetry dataLizhi Hou
Extend DRM_IOCTL_AMDXDNA_GET_INFO to include additional parameters that allow collection of telemetry data. Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251104062546.833771-3-lizhi.hou@amd.com
2025-11-04accel/amdxdna: Add IOCTL parameter for resource dataLizhi Hou
Extend DRM_IOCTL_AMDXDNA_GET_INFO to include additional parameters that allow collection of resource data. Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251104062546.833771-2-lizhi.hou@amd.com
2025-11-04accel/amdxdna: Add hardware specific attributesLizhi Hou
Add three hardware specific attributes to describe device capabilities: hwctx_limit: The maximum number of hardware context supported. max_tops: The maximum TOPS supported. curr_tops: The TOPS achievable with the current power and frequency configuration. Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251104062546.833771-1-lizhi.hou@amd.com
2025-11-03accel/amdxdna: Use MSG_OP_CHAIN_EXEC_NPU when supportedLizhi Hou
MSG_OP_CHAIN_EXEC_NPU is a unified mailbox message that replaces MSG_OP_CHAIN_EXEC_BUFFER_CF and MSG_OP_CHAIN_EXEC_DPU. Add driver logic to check firmware version, and if MSG_OP_CHAIN_EXEC_NPU is supported, uses it to submit firmware commands. Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251031014700.2919349-1-lizhi.hou@amd.com
2025-10-31drm: include drm_print.h where neededJani Nikula
There are a gazillion files that depend on drm_print.h being indirectly included via drm_buddy.h, drm_mm.h, or ttm/ttm_resource.h. In preparation for removing those includes, explicitly include drm_print.h where needed. Cc: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Link: https://lore.kernel.org/r/5fe67395907be33eb5199ea6d540e29fddee71c8.1761734313.git.jani.nikula@intel.com
2025-10-30accel/amdxdna: Fix incorrect command state for timed out jobLizhi Hou
When a command times out, mark it as ERT_CMD_STATE_TIMEOUT. Any other commands that are canceled due to this timeout should be marked as ERT_CMD_STATE_ABORT. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251029193423.2430463-1-lizhi.hou@amd.com
2025-10-30accel/ivpu: Wait for CDYN de-assertion during power down sequenceKarol Wachowski
During power down, pending DVFS operations may still be in progress when the NPU reset is asserted after CDYN=0 is set. Since the READY bit may already be deasserted at this point, checking only the READY bit is insufficient to ensure all transactions have completed. Add an explicit check for CDYN de-assertion after the READY bit check to guarantee no outstanding transactions remain before proceeding. Fixes: 550f4dd2cedd ("accel/ivpu: Add support for Nova Lake's NPU") Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com> Link: https://patch.msgid.link/20251030091700.293341-1-karol.wachowski@linux.intel.com