summaryrefslogtreecommitdiff
path: root/drivers/iommu/intel
AgeCommit message (Collapse)Author
39 hoursConvert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 daystreewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
12 daysMerge tag 'iommu-updates-v7.0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux Pull iommu updates from Joerg Roedel: "Core changes: - Rust bindings for IO-pgtable code - IOMMU page allocation debugging support - Disable ATS during PCI resets Intel VT-d changes: - Skip dev-iotlb flush for inaccessible PCIe device - Flush cache for PASID table before using it - Use right invalidation method for SVA and NESTED domains - Ensure atomicity in context and PASID entry updates AMD-Vi changes: - Support for nested translations - Other minor improvements ARM-SMMU-v2 changes: - Configure SoC-specific prefetcher settings for Qualcomm's "MDSS" ARM-SMMU-v3 changes: - Improve CMDQ locking fairness for pathetically small queue sizes - Remove tracking of the IAS as this is only relevant for AArch32 and was causing C_BAD_STE errors - Add device-tree support for NVIDIA's CMDQV extension - Allow some hitless transitions for the 'MEV' and 'EATS' STE fields - Don't disable ATS for nested S1-bypass nested domains - Additions to the kunit selftests" * tag 'iommu-updates-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (54 commits) iommupt: Always add IOVA range to iotlb_gather in gather_range_pages() iommu/amd: serialize sequence allocation under concurrent TLB invalidations iommu/amd: Fix type of type parameter to amd_iommufd_hw_info() iommu/arm-smmu-v3: Do not set disable_ats unless vSTE is Translate iommu/arm-smmu-v3-test: Add nested s1bypass/s1dssbypass coverage iommu/arm-smmu-v3: Mark EATS_TRANS safe when computing the update sequence iommu/arm-smmu-v3: Mark STE MEV safe when computing the update sequence iommu/arm-smmu-v3: Add update_safe bits to fix STE update sequence iommu/arm-smmu-v3: Add device-tree support for CMDQV driver iommu/tegra241-cmdqv: Decouple driver from ACPI iommu/arm-smmu-qcom: Restore ACTLR settings for MDSS on sa8775p iommu/vt-d: Fix race condition during PASID entry replacement iommu/vt-d: Clear Present bit before tearing down context entry iommu/vt-d: Clear Present bit before tearing down PASID entry iommu/vt-d: Flush piotlb for SVM and Nested domain iommu/vt-d: Flush cache for PASID table before using it iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode rust: iommu: fix `srctree` link warning rust: iommu: fix Rust formatting ...
13 daysMerge tag 'x86-irq-2026-02-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 irq updates from Thomas Gleixner: "Trivial cleanups for the posted MSI interrupt handling" * tag 'x86-irq-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/irq_remapping: Sanitize posted_msi_supported() x86/irq: Cleanup posted MSI code
2026-02-06Merge branches 'fixes', 'arm/smmu/updates', 'intel/vt-d', 'amd/amd-vi' and ↵Joerg Roedel
'core' into next
2026-02-06iommu/vt-d: Treat PAGE_SNOOP and PWSNP separatelyViktor Kleen
The PASID_FLAG_PAGE_SNOOP and PASID_FLAG_PWSNP constants are identical. This will cause the pasid code to always set both or neither of the PGSNP and PWSNP bits in PASID table entries. However, PWSNP is a reserved bit if SMPWC is not set in the IOMMU's extended capability register, even if SC is supported. This has resulted in DMAR errors when testing the iommufd code on an Arrow Lake platform. With this patch, those errors disappear and the PASID table entries look correct. Fixes: 101a2854110fa ("iommu/vt-d: Follow PT_FEAT_DMA_INCOHERENT into the PASID entry") Cc: stable@vger.kernel.org Signed-off-by: Viktor Kleen <viktor@kleen.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20260202192109.1665799-1-viktor@kleen.org Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-22iommu/vt-d: Fix race condition during PASID entry replacementLu Baolu
The Intel VT-d PASID table entry is 512 bits (64 bytes). When replacing an active PASID entry (e.g., during domain replacement), the current implementation calculates a new entry on the stack and copies it to the table using a single structure assignment. struct pasid_entry *pte, new_pte; pte = intel_pasid_get_entry(dev, pasid); pasid_pte_config_first_level(iommu, &new_pte, ...); *pte = new_pte; Because the hardware may fetch the 512-bit PASID entry in multiple 128-bit chunks, updating the entire entry while it is active (Present bit set) risks a "torn" read. In this scenario, the IOMMU hardware could observe an inconsistent state — partially new data and partially old data — leading to unpredictable behavior or spurious faults. Fix this by removing the unsafe "replace" helpers and following the "clear-then-update" flow, which ensures the Present bit is cleared and the required invalidation handshake is completed before the new configuration is applied. Fixes: 7543ee63e811 ("iommu/vt-d: Add pasid replace helpers") Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20260120061816.2132558-4-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-22iommu/vt-d: Clear Present bit before tearing down context entryLu Baolu
When tearing down a context entry, the current implementation zeros the entire 128-bit entry using multiple 64-bit writes. This creates a window where the hardware can fetch a "torn" entry — where some fields are already zeroed while the 'Present' bit is still set — leading to unpredictable behavior or spurious faults. While x86 provides strong write ordering, the compiler may reorder writes to the two 64-bit halves of the context entry. Even without compiler reordering, the hardware fetch is not guaranteed to be atomic with respect to multiple CPU writes. Align with the "Guidance to Software for Invalidations" in the VT-d spec (Section 6.5.3.3) by implementing the recommended ownership handshake: 1. Clear only the 'Present' (P) bit of the context entry first to signal the transition of ownership from hardware to software. 2. Use dma_wmb() to ensure the cleared bit is visible to the IOMMU. 3. Perform the required cache and context-cache invalidation to ensure hardware no longer has cached references to the entry. 4. Fully zero out the entry only after the invalidation is complete. Also, add a dma_wmb() to context_set_present() to ensure the entry is fully initialized before the 'Present' bit becomes visible. Fixes: ba39592764ed2 ("Intel IOMMU: Intel IOMMU driver") Reported-by: Dmytro Maluka <dmaluka@chromium.org> Closes: https://lore.kernel.org/all/aTG7gc7I5wExai3S@google.com/ Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Dmytro Maluka <dmaluka@chromium.org> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20260120061816.2132558-3-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-22iommu/vt-d: Clear Present bit before tearing down PASID entryLu Baolu
The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64 bytes). When tearing down an entry, the current implementation zeros the entire 64-byte structure immediately using multiple 64-bit writes. Since the IOMMU hardware may fetch these 64 bytes using multiple internal transactions (e.g., four 128-bit bursts), updating or zeroing the entire entry while it is active (P=1) risks a "torn" read. If a hardware fetch occurs simultaneously with the CPU zeroing the entry, the hardware could observe an inconsistent state, leading to unpredictable behavior or spurious faults. Follow the "Guidance to Software for Invalidations" in the VT-d spec (Section 6.5.3.3) by implementing the recommended ownership handshake: 1. Clear only the 'Present' (P) bit of the PASID entry. 2. Use a dma_wmb() to ensure the cleared bit is visible to hardware before proceeding. 3. Execute the required invalidation sequence (PASID cache, IOTLB, and Device-TLB flush) to ensure the hardware has released all cached references. 4. Only after the flushes are complete, zero out the remaining fields of the PASID entry. Also, add a dma_wmb() in pasid_set_present() to ensure that all other fields of the PASID entry are visible to the hardware before the Present bit is set. Fixes: 0bbeb01a4faf ("iommu/vt-d: Manage scalalble mode PASID tables") Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Dmytro Maluka <dmaluka@chromium.org> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20260120061816.2132558-2-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-22iommu/vt-d: Flush piotlb for SVM and Nested domainYi Liu
Besides the paging domains that use FS, SVM and Nested domains need to use piotlb invalidation descriptor as well. Fixes: b33125296b50 ("iommu/vt-d: Create unique domain ops for each stage") Cc: stable@vger.kernel.org Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20251223065824.6164-1-yi.l.liu@intel.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-22iommu/vt-d: Flush cache for PASID table before using itDmytro Maluka
When writing the address of a freshly allocated zero-initialized PASID table to a PASID directory entry, do that after the CPU cache flush for this PASID table, not before it, to avoid the time window when this PASID table may be already used by non-coherent IOMMU hardware while its contents in RAM is still some random old data, not zero-initialized. Fixes: 194b3348bdbb ("iommu/vt-d: Fix PASID directory pointer coherency") Signed-off-by: Dmytro Maluka <dmaluka@chromium.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20251221123508.37495-1-dmaluka@chromium.org Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-22iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable modeJinhui Guo
Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") relies on pci_dev_is_disconnected() to skip ATS invalidation for safely-removed devices, but it does not cover link-down caused by faults, which can still hard-lock the system. For example, if a VM fails to connect to the PCIe device, "virsh destroy" is executed to release resources and isolate the fault, but a hard-lockup occurs while releasing the group fd. Call Trace: qi_submit_sync qi_flush_dev_iotlb intel_pasid_tear_down_entry device_block_translation blocking_domain_attach_dev __iommu_attach_device __iommu_device_set_domain __iommu_group_set_domain_internal iommu_detach_group vfio_iommu_type1_detach_group vfio_group_detach_container vfio_group_fops_release __fput Although pci_device_is_present() is slower than pci_dev_is_disconnected(), it still takes only ~70 µs on a ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed and width increase. Besides, devtlb_invalidation_with_pasid() is called only in the paths below, which are far less frequent than memory map/unmap. 1. mm-struct release 2. {attach,release}_dev 3. set/remove PASID 4. dirty-tracking setup The gain in system stability far outweighs the negligible cost of using pci_device_is_present() instead of pci_dev_is_disconnected() to decide when to skip ATS invalidation, especially under GDR high-load conditions. Fixes: 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") Cc: stable@vger.kernel.org Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com> Link: https://lore.kernel.org/r/20251211035946.2071-3-guojinhui.liam@bytedance.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-22iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without ↵Jinhui Guo
scalable mode PCIe endpoints with ATS enabled and passed through to userspace (e.g., QEMU, DPDK) can hard-lock the host when their link drops, either by surprise removal or by a link fault. Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") adds pci_dev_is_disconnected() to devtlb_invalidation_with_pasid() so ATS invalidation is skipped only when the device is being safely removed, but it applies only when Intel IOMMU scalable mode is enabled. With scalable mode disabled or unsupported, a system hard-lock occurs when a PCIe endpoint's link drops because the Intel IOMMU waits indefinitely for an ATS invalidation that cannot complete. Call Trace: qi_submit_sync qi_flush_dev_iotlb __context_flush_dev_iotlb.part.0 domain_context_clear_one_cb pci_for_each_dma_alias device_block_translation blocking_domain_attach_dev iommu_deinit_device __iommu_group_remove_device iommu_release_device iommu_bus_notifier blocking_notifier_call_chain bus_notify device_del pci_remove_bus_device pci_stop_and_remove_bus_device pciehp_unconfigure_device pciehp_disable_slot pciehp_handle_presence_or_link_change pciehp_ist Commit 81e921fd3216 ("iommu/vt-d: Fix NULL domain on device release") adds intel_pasid_teardown_sm_context() to intel_iommu_release_device(), which calls qi_flush_dev_iotlb() and can also hard-lock the system when a PCIe endpoint's link drops. Call Trace: qi_submit_sync qi_flush_dev_iotlb __context_flush_dev_iotlb.part.0 intel_context_flush_no_pasid device_pasid_table_teardown pci_pasid_table_teardown pci_for_each_dma_alias intel_pasid_teardown_sm_context intel_iommu_release_device iommu_deinit_device __iommu_group_remove_device iommu_release_device iommu_bus_notifier blocking_notifier_call_chain bus_notify device_del pci_remove_bus_device pci_stop_and_remove_bus_device pciehp_unconfigure_device pciehp_disable_slot pciehp_handle_presence_or_link_change pciehp_ist Sometimes the endpoint loses connection without a link-down event (e.g., due to a link fault); killing the process (virsh destroy) then hard-locks the host. Call Trace: qi_submit_sync qi_flush_dev_iotlb __context_flush_dev_iotlb.part.0 domain_context_clear_one_cb pci_for_each_dma_alias device_block_translation blocking_domain_attach_dev __iommu_attach_device __iommu_device_set_domain __iommu_group_set_domain_internal iommu_detach_group vfio_iommu_type1_detach_group vfio_group_detach_container vfio_group_fops_release __fput pci_dev_is_disconnected() only covers safe-removal paths; pci_device_is_present() tests accessibility by reading vendor/device IDs and internally calls pci_dev_is_disconnected(). On a ConnectX-5 (8 GT/s, x2) this costs ~70 µs. Since __context_flush_dev_iotlb() is only called on {attach,release}_dev paths (not hot), add pci_device_is_present() there to skip inaccessible devices and avoid the hard-lock. Fixes: 37764b952e1b ("iommu/vt-d: Global devTLB flush when present context entry changed") Fixes: 81e921fd3216 ("iommu/vt-d: Fix NULL domain on device release") Cc: stable@vger.kernel.org Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com> Link: https://lore.kernel.org/r/20251211035946.2071-2-guojinhui.liam@bytedance.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-12-18x86/irq_remapping: Sanitize posted_msi_supported()Thomas Gleixner
posted_msi_supported() is a misnomer as it actually checks whether it is enabled or not. Aside of that this does not take CONFIG_X86_POSTED_MSI into account which is required to actually use it. Rename it to posted_msi_enabled() and make the return value depend on CONFIG_X86_POSTED_MSI, which allows the compiler to eliminate the related dead code and data if disabled: text data bss dec hex filename 10046 701 3296 14043 36db drivers/iommu/intel/irq_remapping.o 9904 413 3296 13613 352d drivers/iommu/intel/irq_remapping.o Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251125214631.170499997@linutronix.de
2025-12-17x86/msi: Make irq_retrigger() functional for posted MSIThomas Gleixner
Luigi reported that retriggering a posted MSI interrupt does not work correctly. The reason is that the retrigger happens at the vector domain by sending an IPI to the actual vector on the target CPU. That works correctly exactly once because the posted MSI interrupt chip does not issue an EOI as that's only required for the posted MSI notification vector itself. As a consequence the vector becomes stale in the ISR, which not only affects this vector but also any lower priority vector in the affected APIC because the ISR bit is not cleared. Luigi proposed to set the vector in the remap PIR bitmap and raise the posted MSI notification vector. That works, but that still does not cure a related problem: If there is ever a stray interrupt on such a vector, then the related APIC ISR bit becomes stale due to the lack of EOI as described above. Unlikely to happen, but if it happens it's not debuggable at all. So instead of playing games with the PIR, this can be actually solved for both cases by: 1) Keeping track of the posted interrupt vector handler state 2) Implementing a posted MSI specific irq_ack() callback which checks that state. If the posted vector handler is inactive it issues an EOI, otherwise it delegates that to the posted handler. This is correct versus affinity changes and concurrent events on the posted vector as the actual handler invocation is serialized through the interrupt descriptor lock. Fixes: ed1e48ea4370 ("iommu/vt-d: Enable posted mode for device MSIs") Reported-by: Luigi Rizzo <lrizzo@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Luigi Rizzo <lrizzo@google.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20251125214631.044440658@linutronix.de Closes: https://lore.kernel.org/lkml/20251124104836.3685533-1-lrizzo@google.com
2025-12-05Merge tag 'soc-drivers-6.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull SoC driver updates from Arnd Bergmann: "This is the first half of the driver changes: - A treewide interface change to the "syscore" operations for power management, as a preparation for future Tegra specific changes - Reset controller updates with added drivers for LAN969x, eic770 and RZ/G3S SoCs - Protection of system controller registers on Renesas and Google SoCs, to prevent trivially triggering a system crash from e.g. debugfs access - soc_device identification updates on Nvidia, Exynos and Mediatek - debugfs support in the ST STM32 firewall driver - Minor updates for SoC drivers on AMD/Xilinx, Renesas, Allwinner, TI - Cleanups for memory controller support on Nvidia and Renesas" * tag 'soc-drivers-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (114 commits) memory: tegra186-emc: Fix missing put_bpmp Documentation: reset: Remove reset_controller_add_lookup() reset: fix BIT macro reference reset: rzg2l-usbphy-ctrl: Fix a NULL vs IS_ERR() bug in probe reset: th1520: Support reset controllers in more subsystems reset: th1520: Prepare for supporting multiple controllers dt-bindings: reset: thead,th1520-reset: Add controllers for more subsys dt-bindings: reset: thead,th1520-reset: Remove non-VO-subsystem resets reset: remove legacy reset lookup code clk: davinci: psc: drop unused reset lookup reset: rzg2l-usbphy-ctrl: Add support for RZ/G3S SoC reset: rzg2l-usbphy-ctrl: Add support for USB PWRRDY dt-bindings: reset: renesas,rzg2l-usbphy-ctrl: Document RZ/G3S support reset: eswin: Add eic7700 reset driver dt-bindings: reset: eswin: Documentation for eic7700 SoC reset: sparx5: add LAN969x support dt-bindings: reset: microchip: Add LAN969x support soc: rockchip: grf: Add select correct PWM implementation on RK3368 soc/tegra: pmc: Add USB wake events for Tegra234 amba: tegra-ahb: Fix device leak on SMMU enable ...
2025-11-28Merge branches 'arm/smmu/updates', 'arm/smmu/bindings', 'mediatek', ↵Joerg Roedel
'nvidia/tegra', 'intel/vt-d', 'amd/amd-vi' and 'core' into next
2025-11-28iommupt/vtd: Support mgaw's less than a 4 level walk for first stageJason Gunthorpe
If the IOVA is limited to less than 48 the page table will be constructed with a 3 level configuration which is unsupported by hardware. Like the second stage the caller needs to pass in both the top_level an the vasz to specify a table that has more levels than required to hold the IOVA range. Fixes: 6cbc09b7719e ("iommu/vt-d: Restore previous domain::aperture_end calculation") Reported-by: Calvin Owens <calvin@wbinvd.org> Closes: https://lore.kernel.org/r/8f257d2651eb8a4358fcbd47b0145002e5f1d638.1764237717.git.calvin@wbinvd.org Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Calvin Owens <calvin@wbinvd.org> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-28iommupt/vtd: Allow VT-d to have a larger table top than the vasz requiresJason Gunthorpe
VT-d second stage HW specifies both the maximum IOVA and the supported table walk starting points. Weirdly there is HW that only supports a 4 level walk but has a maximum IOVA that only needs 3. The current code miscalculates this and creates a wrongly sized page table which ultimately fails the compatibility check for number of levels. This is fixed by allowing the page table to be created with both a vasz and top_level input. The vasz will set the aperture for the domain while the top_level will set the page table geometry. Add top_level to vtdss and correct the logic in VT-d to generate the right top_level and vasz from mgaw and sagaw. Fixes: d373449d8e97 ("iommu/vt-d: Use the generic iommu page table") Reported-by: Calvin Owens <calvin@wbinvd.org> Closes: https://lore.kernel.org/r/8f257d2651eb8a4358fcbd47b0145002e5f1d638.1764237717.git.calvin@wbinvd.org Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Calvin Owens <calvin@wbinvd.org> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-20iommu/vt-d: Restore previous domain::aperture_end calculationLu Baolu
Commit d373449d8e97 ("iommu/vt-d: Use the generic iommu page table") changed the calculation of domain::aperture_end. Previously, it was calculated as: domain->domain.geometry.aperture_end = __DOMAIN_MAX_ADDR(domain->gaw - 1); where domain->gaw was limited to less than MGAW. Currently, it is calculated purely based on the max level of the page table that the hardware supports. This is incorrect as stated in Section 3.6 of the VT-d spec: "Software using first-stage translation structures to translate an IO Virtual Address (IOVA) must use canonical addresses. Additionally, software must limit addresses to less than the minimum of MGAW and the lower canonical address width implied by FSPM (i.e., 47-bit when FSPM is 4-level and 56-bit when FSPM is 5-level)." Restore the previous calculation method for domain::aperture_end to avoid violating the spec. Incorrect aperture calculation causes GPU hangs without generating VT-d faults on some Intel client platforms. Fixes: d373449d8e97 ("iommu/vt-d: Use the generic iommu page table") Reported-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Closes: https://lore.kernel.org/r/4f15cf3b-6fad-4cd8-87e5-6d86c0082673@intel.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Suggested-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-20iommu/vt-d: Fix unused invalidation hint in qi_desc_iotlbAashish Sharma
Invalidation hint (ih) in the function 'qi_desc_iotlb' is initialized to zero and never used. It is embedded in the 0th bit of the 'addr' parameter. Get the correct 'ih' value from there. Fixes: f701c9f36bcb ("iommu/vt-d: Factor out invalidation descriptor composition") Signed-off-by: Aashish Sharma <aashish@aashishsharma.net> Link: https://lore.kernel.org/r/20251009010903.1323979-1-aashish@aashishsharma.net Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-20iommu/vt-d: Set INTEL_IOMMU_FLOPPY_WA depend on BLK_DEV_FDVineeth Pillai (Google)
INTEL_IOMMU_FLOPPY_WA workaround was introduced to create direct mappings for first 16MB for floppy devices as the floppy drivers were not using dma apis. We need not do this direct map if floppy driver is not enabled. INTEL_IOMMU_FLOPPY_WA is generally not a good idea. Iommu will be mapping pages in this address range while kernel would also be allocating from this range(mostly on memory stress). A misbehaving device using this domain will have access to the pages that the kernel might be actively using. We noticed this while running a test that was trying to figure out if any pages used by kernel is in iommu page tables. This patch reduces the scope of the above issue by disabling the workaround when floppy driver is not enabled. But we would still need to fix the floppy driver to use dma apis so that we need not do direct map without reserving the pages. Or the other option is to reserve this memory range in firmware so that kernel will not use the pages. Fixes: d850c2ee5fe2 ("iommu/vt-d: Expose ISA direct mapping region via iommu_get_resv_regions") Fixes: 49a0429e53f2 ("Intel IOMMU: Iommu floppy workaround") Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org> Link: https://lore.kernel.org/r/20251002161625.1155133-1-vineeth@bitbyteword.org Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-14syscore: Pass context data to callbacksThierry Reding
Several drivers can benefit from registering per-instance data along with the syscore operations. To achieve this, move the modifiable fields out of the syscore_ops structure and into a separate struct syscore that can be registered with the framework. Add a void * driver data field for drivers to store contextual data that will be passed to the syscore ops. Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Signed-off-by: Thierry Reding <treding@nvidia.com>
2025-11-05iommu/vt-d: Follow PT_FEAT_DMA_INCOHERENT into the PASID entryJason Gunthorpe
Currently a incoherent walk domain cannot be attached to a coherent capable iommu. Kevin says HW probably doesn't exist with such a mixture, but making the driver support it makes logical sense anyhow. When building the PASID entry the PWSNP (Page Walk Snoop) bit tells the HW if it should issue snoops. If the page table is cache flushed because of PT_FEAT_DMA_INCOHERENT then it is fine to set this bit to 0 even if the HW supports 1. Weaken the compatible check to permit a coherent instance to accept an incoherent table and fix the PASID table construction to set PWSNP from PT_FEAT_DMA_INCOHERENT. SVA always sets PWSNP. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommu/vt-d: Use the generic iommu page tableJason Gunthorpe
Replace the VT-d iommu_domain implementation of the VT-d second stage and first stage page tables with the iommupt VTDSS and x86_64 pagetables. x86_64 is shared with the AMD driver. There are a couple notable things in VT-d: - Like AMD the second stage format is not sign extended, unlike AMD it cannot decode a full 64 bits. The first stage format is a normal sign extended x86 page table - The HW caps can indicate how many levels, how many address bits and what leaf page sizes are supported in HW. As before the highest number of levels that can translate the entire supported address width is used. The supported page sizes are adjusted directly from the dedicated first/second stage cap bits. - VTD requires flushing 'write buffers'. This logic is left unchanged, the write buffer flushes on any gather flush or through iotlb_sync_map. - Like ARM, VTD has an optional non-coherent page table walker that requires cache flushing. This is supported through PT_FEAT_DMA_INCOHERENT the same as ARM, however x86 can't use the DMA API for flush, it must call the arch function clflush_cache_range() - The PT_FEAT_DYNAMIC_TOP can probably be supported on VT-d someday for the second stage when it uses 128 bit atomic stores for the HW context structures. - PT_FEAT_VTDSS_FORCE_WRITEABLE is used to work around ERRATA_772415_SPR17 - A kernel command line parameter "sp_off" disables all page sizes except 4k Remove all the unused iommu_domain page table code. The debugfs paths have their own independent page table walker that is left alone for now. This corrects a race with the non-coherent walker that the ARM implementations have fixed: CPU 0 CPU 1 pfn_to_dma_pte() pfn_to_dma_pte() pte = &parent[offset]; if (!dma_pte_present(pte)) { try_cmpxchg64(&pte->val) pte = &parent[offset]; .. dma_pte_present(pte) .. [...] // iommu_map() completes // Device does DMA domain_flush_cache(pte) The CPU 1 mapping operation shares a page table level with the CPU 0 mapping operation. CPU 0 installed a new page table level but has not flushed it yet. CPU1 returns from iommu_map() and the device does DMA. The non coherent walker fails to see the new table level installed by CPU 0 and fails the DMA with non-present. The iommupt PT_FEAT_DMA_INCOHERENT implementation uses the ARM design of storing a flag when CPU 0 completes the flush. If the flag is not set CPU 1 will also flush to ensure the HW can fully walk to the PTE being installed. Cc: Tina Zhang <tina.zhang@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-10-27iommu: Pass in old domain to attach_dev callback functionsNicolin Chen
The IOMMU core attaches each device to a default domain on probe(). Then, every new "attach" operation has a fundamental meaning of two-fold: - detach from its currently attached (old) domain - attach to a given new domain Modern IOMMU drivers following this pattern usually want to clean up the things related to the old domain, so they call iommu_get_domain_for_dev() to fetch the old domain. Pass in the old domain pointer from the core to drivers, aligning with the set_dev_pasid op that does so already. Ensure all low-level attach fcuntions in the core can forward the correct old domain pointer. Thus, rework those functions as well. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-09-26Merge branches 'apple/dart', 'ti/omap', 'riscv', 'intel/vt-d' and ↵Joerg Roedel
'amd/amd-vi' into next
2025-09-26iommu/vt-d: Disallow dirty tracking if incoherent page walkLu Baolu
Dirty page tracking relies on the IOMMU atomically updating the dirty bit in the paging-structure entry. For this operation to succeed, the paging- structure memory must be coherent between the IOMMU and the CPU. In another word, if the iommu page walk is incoherent, dirty page tracking doesn't work. The Intel VT-d specification, Section 3.10 "Snoop Behavior" states: "Remapping hardware encountering the need to atomically update A/EA/D bits in a paging-structure entry that is not snooped will result in a non- recoverable fault." To prevent an IOMMU from being incorrectly configured for dirty page tracking when it is operating in an incoherent mode, mark SSADS as supported only when both ecap_slads and ecap_smpwc are supported. Fixes: f35f22cc760e ("iommu/vt-d: Access/Dirty bit support for SS domains") Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20250924083447.123224-1-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-09-19iommu/vt-d: debugfs: Avoid dumping context command registerLu Baolu
The register-based cache invalidation interface is in the process of being replaced by the queued invalidation interface. The VT-d architecture allows hardware implementations with a queued invalidation interface to not implement the registers used for cache invalidation. Currently, the debugfs interface dumps the Context Command Register unconditionally, which is not reasonable. Remove it to avoid potential access to non-present registers. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250917025051.143853-1-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-09-19iommu/vt-d: Removal of Advanced Fault LoggingLu Baolu
The advanced fault logging has been removed from the specification since v4.0. Linux doesn't implement advanced fault logging functionality, but it currently dumps the advanced logging registers through debugfs. Remove the dumping of these advanced fault logging registers through debugfs to avoid potential access to non-present registers. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250917024850.143801-1-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-09-19iommu/vt-d: PRS isn't usable if PDS isn't supportedLu Baolu
The specification, Section 7.10, "Software Steps to Drain Page Requests & Responses," requires software to submit an Invalidation Wait Descriptor (inv_wait_dsc) with the Page-request Drain (PD=1) flag set, along with the Invalidation Wait Completion Status Write flag (SW=1). It then waits for the Invalidation Wait Descriptor's completion. However, the PD field in the Invalidation Wait Descriptor is optional, as stated in Section 6.5.2.9, "Invalidation Wait Descriptor": "Page-request Drain (PD): Remapping hardware implementations reporting Page-request draining as not supported (PDS = 0 in ECAP_REG) treat this field as reserved." This implies that if the IOMMU doesn't support the PDS capability, software can't drain page requests and group responses as expected. Do not enable PCI/PRI if the IOMMU doesn't support PDS. Reported-by: Joel Granados <joel.granados@kernel.org> Closes: https://lore.kernel.org/r/20250909-jag-pds-v1-1-ad8cba0e494e@kernel.org Fixes: 66ac4db36f4c ("iommu/vt-d: Add page request draining support") Cc: stable@vger.kernel.org Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250915062946.120196-1-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-09-19iommu/vt-d: Remove LPIG from page group response descriptorLu Baolu
Bit 66 in the page group response descriptor used to be the LPIG (Last Page in Group), but it was marked as Reserved since Specification 4.0. Remove programming on this bit to make it consistent with the latest specification. Existing hardware all treats bit 66 of the page group response descriptor as "ignored", therefore this change doesn't break any existing hardware. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250901053943.1708490-1-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-09-19iommu/vt-d: Drop unused cap_super_offset()Yury Norov (NVIDIA)
The macro is unused. Drop the dead code. Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Link: https://lore.kernel.org/r/20250913015024.81186-1-yury.norov@gmail.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-09-19iommu/vt-d: debugfs: Fix legacy mode page table dump logicVineeth Pillai (Google)
In legacy mode, SSPTPTR is ignored if TT is not 00b or 01b. SSPTPTR maybe uninitialized or zero in that case and may cause oops like: Oops: general protection fault, probably for non-canonical address 0xf00087d3f000f000: 0000 [#1] SMP NOPTI CPU: 2 UID: 0 PID: 786 Comm: cat Not tainted 6.16.0 #191 PREEMPT(voluntary) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014 RIP: 0010:pgtable_walk_level+0x98/0x150 RSP: 0018:ffffc90000f279c0 EFLAGS: 00010206 RAX: 0000000040000000 RBX: ffffc90000f27ab0 RCX: 000000000000001e RDX: 0000000000000003 RSI: f00087d3f000f000 RDI: f00087d3f0010000 RBP: ffffc90000f27a00 R08: ffffc90000f27a98 R09: 0000000000000002 R10: 0000000000000000 R11: 0000000000000000 R12: f00087d3f000f000 R13: 0000000000000000 R14: 0000000040000000 R15: ffffc90000f27a98 FS: 0000764566dcb740(0000) GS:ffff8881f812c000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000764566d44000 CR3: 0000000109d81003 CR4: 0000000000772ef0 PKRU: 55555554 Call Trace: <TASK> pgtable_walk_level+0x88/0x150 domain_translation_struct_show.isra.0+0x2d9/0x300 dev_domain_translation_struct_show+0x20/0x40 seq_read_iter+0x12d/0x490 ... Avoid walking the page table if TT is not 00b or 01b. Fixes: 2b437e804566 ("iommu/vt-d: debugfs: Support dumping a specified page table") Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20250814163153.634680-1-vineeth@bitbyteword.org Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-09-19iommu/vt-d: Replace snprintf with scnprintf in dmar_latency_snapshot()Seyediman Seyedarab
snprintf() returns the number of bytes that would have been written, not the number actually written. Using this for offset tracking can cause buffer overruns if truncation occurs. Replace snprintf() with scnprintf() to ensure the offset stays within bounds. Since scnprintf() never returns a negative value, and zero is not possible in this context because 'bytes' starts at 0 and 'size - bytes' is DEBUG_BUFFER_SIZE in the first call, which is large enough to hold the string literals used, the return value is always positive. An integer overflow is also completely out of reach here due to the small and fixed buffer size. The error check in latency_show_one() is therefore unnecessary. Remove it and make dmar_latency_snapshot() return void. Signed-off-by: Seyediman Seyedarab <ImanDevel@gmail.com> Link: https://lore.kernel.org/r/20250731225048.131364-1-ImanDevel@gmail.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-09-05iommu/vt-d: Fix __domain_mapping()'s usage of switch_to_super_page()Eugene Koira
switch_to_super_page() assumes the memory range it's working on is aligned to the target large page level. Unfortunately, __domain_mapping() doesn't take this into account when using it, and will pass unaligned ranges ultimately freeing a PTE range larger than expected. Take for example a mapping with the following iov_pfn range [0x3fe400, 0x4c0600), which should be backed by the following mappings: iov_pfn [0x3fe400, 0x3fffff] covered by 2MiB pages iov_pfn [0x400000, 0x4bffff] covered by 1GiB pages iov_pfn [0x4c0000, 0x4c05ff] covered by 2MiB pages Under this circumstance, __domain_mapping() will pass [0x400000, 0x4c05ff] to switch_to_super_page() at a 1 GiB granularity, which will in turn free PTEs all the way to iov_pfn 0x4fffff. Mitigate this by rounding down the iov_pfn range passed to switch_to_super_page() in __domain_mapping() to the target large page level. Additionally add range alignment checks to switch_to_super_page. Fixes: 9906b9352a35 ("iommu/vt-d: Avoid duplicate removing in __domain_mapping()") Signed-off-by: Eugene Koira <eugkoira@amazon.com> Cc: stable@vger.kernel.org Reviewed-by: Nicolas Saenz Julienne <nsaenz@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20250826143816.38686-1-eugkoira@amazon.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-08-01Merge tag 'pci-v6.17-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci Pull PCI updates from Bjorn Helgaas: "Enumeration: - Allow built-in drivers, not just modular drivers, to use async initial probing (Lukas Wunner) - Support Immediate Readiness even on devices with no PM Capability (Sean Christopherson) - Consolidate definition of PCIE_RESET_CONFIG_WAIT_MS (100ms), the required delay between a reset and sending config requests to a device (Niklas Cassel) - Add pci_is_display() to check for "Display" base class and use it in ALSA hda, vfio, vga_switcheroo, vt-d (Mario Limonciello) - Allow 'isolated PCI functions' (multi-function devices without a function 0) for LoongArch, similar to s390 and jailhouse (Huacai Chen) Power control: - Add ability to enable optional slot clock for cases where the PCIe host controller and the slot are supplied by different clocks (Marek Vasut) PCIe native device hotplug: - Fix runtime PM ref imbalance on Hot-Plug Capable ports caused by misinterpreting a config read failure after a device has been removed (Lukas Wunner) - Avoid creating a useless PCIe port service device for pciehp if the slot is handled by the ACPI hotplug driver (Lukas Wunner) - Ignore ACPI hotplug slots when calculating depth of pciehp hotplug ports (Lukas Wunner) Virtualization: - Save VF resizable BAR state and restore it after reset (Michał Winiarski) - Allow IOV resources (VF BARs) to be resized (Michał Winiarski) - Add pci_iov_vf_bar_set_size() so drivers can control VF BAR size (Michał Winiarski) Endpoint framework: - Add RC-to-EP doorbell support using platform MSI controller, including a test case (Frank Li) - Allow BAR assignment via configfs so platforms have flexibility in determining BAR usage (Jerome Brunet) Native PCIe controller drivers: - Convert amazon,al-alpine-v[23]-pcie, apm,xgene-pcie, axis,artpec6-pcie, marvell,armada-3700-pcie, st,spear1340-pcie to DT schema format (Rob Herring) - Use dev_fwnode() instead of of_fwnode_handle() to remove OF dependency in altera (fixes an unused variable), designware-host, mediatek, mediatek-gen3, mobiveil, plda, xilinx, xilinx-dma, xilinx-nwl (Jiri Slaby, Arnd Bergmann) - Convert aardvark, altera, brcmstb, designware-host, iproc, mediatek, mediatek-gen3, mobiveil, plda, rcar-host, vmd, xilinx, xilinx-dma, xilinx-nwl from using pci_msi_create_irq_domain() to using msi_create_parent_irq_domain() instead; this makes the interrupt controller per-PCI device, allows dynamic allocation of vectors after initialization, and allows support of IMS (Nam Cao) APM X-Gene PCIe controller driver: - Rewrite MSI handling to MSI CPU affinity, drop useless CPU hotplug bits, use device-managed memory allocations, and clean things up (Marc Zyngier) - Probe xgene-msi as a standard platform driver rather than a subsys_initcall (Marc Zyngier) Broadcom STB PCIe controller driver: - Add optional DT 'num-lanes' property and if present, use it to override the Maximum Link Width advertised in Link Capabilities (Jim Quinlan) Cadence PCIe controller driver: - Use PCIe Message routing types from the PCI core rather than defining private ones (Hans Zhang) Freescale i.MX6 PCIe controller driver: - Add IMX8MQ_EP third 64-bit BAR in epc_features (Richard Zhu) - Add IMX8MM_EP and IMX8MP_EP fixed 256-byte BAR 4 in epc_features (Richard Zhu) - Configure LUT for MSI/IOMMU in Endpoint mode so Root Complex can trigger doorbel on Endpoint (Frank Li) - Remove apps_reset (LTSSM_EN) from imx_pcie_{assert,deassert}_core_reset(), which fixes a hotplug regression on i.MX8MM (Richard Zhu) - Delay Endpoint link start until configfs 'start' written (Richard Zhu) Intel VMD host bridge driver: - Add Intel Panther Lake (PTL)-H/P/U Vendor ID (George D Sworo) Qualcomm PCIe controller driver: - Add DT binding and driver support for SA8255p, which supports ECAM for Configuration Space access (Mayank Rana) - Update DT binding and driver to describe PHYs and per-Root Port resets in a Root Port stanza and deprecate describing them in the host bridge; this makes it possible to support multiple Root Ports in the future (Krishna Chaitanya Chundru) - Add Qualcomm QCS615 to SM8150 DT binding (Ziyue Zhang) - Add Qualcomm QCS8300 to SA8775p DT binding (Ziyue Zhang) - Drop TBU and ref clocks from Qualcomm SM8150 and SC8180x DT bindings (Konrad Dybcio) - Document 'link_down' reset in Qualcomm SA8775P DT binding (Ziyue Zhang) - Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ (Niklas Cassel) Rockchip PCIe controller driver: - Drop unused PCIe Message routing and code definitions (Hans Zhang) - Remove several unused header includes (Hans Zhang) - Use standard PCIe config register definitions instead of rockchip-specific redefinitions (Geraldo Nascimento) - Set Target Link Speed to 5.0 GT/s before retraining so we have a chance to train at a higher speed (Geraldo Nascimento) Rockchip DesignWare PCIe controller driver: - Prevent race between link training and register update via DBI by inhibiting link training after hot reset and link down (Wilfred Mallawa) - Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ (Niklas Cassel) Sophgo PCIe controller driver: - Add DT binding and driver for Sophgo SG2044 PCIe controller driver in Root Complex mode (Inochi Amaoto) Synopsys DesignWare PCIe controller driver: - Add required PCIE_RESET_CONFIG_WAIT_MS after waiting for Link up on Ports that support > 5.0 GT/s. Slower Ports still rely on the not-quite-correct PCIE_LINK_WAIT_SLEEP_MS 90ms default delay while waiting for the Link (Niklas Cassel)" * tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (116 commits) dt-bindings: PCI: qcom,pcie-sa8775p: Document 'link_down' reset dt-bindings: PCI: Remove 83xx-512x-pci.txt dt-bindings: PCI: Convert amazon,al-alpine-v[23]-pcie to DT schema dt-bindings: PCI: Convert marvell,armada-3700-pcie to DT schema dt-bindings: PCI: Convert apm,xgene-pcie to DT schema dt-bindings: PCI: Convert axis,artpec6-pcie to DT schema dt-bindings: PCI: Convert st,spear1340-pcie to DT schema PCI: Move is_pciehp check out of pciehp_is_native() PCI: pciehp: Use is_pciehp instead of is_hotplug_bridge PCI/portdrv: Use is_pciehp instead of is_hotplug_bridge PCI/ACPI: Fix runtime PM ref imbalance on Hot-Plug Capable ports selftests: pci_endpoint: Add doorbell test case misc: pci_endpoint_test: Add doorbell test case PCI: endpoint: pci-epf-test: Add doorbell test support PCI: endpoint: Add pci_epf_align_inbound_addr() helper for inbound address alignment PCI: endpoint: pci-ep-msi: Add checks for MSI parent and mutability PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller PCI: dwc: Add Sophgo SG2044 PCIe controller driver in Root Complex mode PCI: vmd: Switch to msi_create_parent_irq_domain() PCI: vmd: Convert to lock guards ...
2025-07-31Merge tag 'for-linus-iommufd' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd updates from Jason Gunthorpe: "This broadly brings the assigned HW command queue support to iommufd. This feature is used to improve SVA performance in VMs by avoiding paravirtualization traps during SVA invalidations. Along the way I think some of the core logic is in a much better state to support future driver backed features. Summary: - IOMMU HW now has features to directly assign HW command queues to a guest VM. In this mode the command queue operates on a limited set of invalidation commands that are suitable for improving guest invalidation performance and easy for the HW to virtualize. This brings the generic infrastructure to allow IOMMU drivers to expose such command queues through the iommufd uAPI, mmap the doorbell pages, and get the guest physical range for the command queue ring itself. - An implementation for the NVIDIA SMMUv3 extension "cmdqv" is built on the new iommufd command queue features. It works with the existing SMMU driver support for cmdqv in guest VMs. - Many precursor cleanups and improvements to support the above cleanly, changes to the general ioctl and object helpers, driver support for VDEVICE, and mmap pgoff cookie infrastructure. - Sequence VDEVICE destruction to always happen before VFIO device destruction. When using the above type features, and also in future confidential compute, the internal virtual device representation becomes linked to HW or CC TSM configuration and objects. If a VFIO device is removed from iommufd those HW objects should also be cleaned up to prevent a sort of UAF. This became important now that we have HW backing the VDEVICE. - Fix one syzkaller found error related to math overflows during iova allocation" * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (57 commits) iommu/arm-smmu-v3: Replace vsmmu_size/type with get_viommu_size iommu/arm-smmu-v3: Do not bother impl_ops if IOMMU_VIOMMU_TYPE_ARM_SMMUV3 iommufd: Rename some shortterm-related identifiers iommufd/selftest: Add coverage for vdevice tombstone iommufd/selftest: Explicitly skip tests for inapplicable variant iommufd/vdevice: Remove struct device reference from struct vdevice iommufd: Destroy vdevice on idevice destroy iommufd: Add a pre_destroy() op for objects iommufd: Add iommufd_object_tombstone_user() helper iommufd/viommu: Roll back to use iommufd_object_alloc() for vdevice iommufd/selftest: Test reserved regions near ULONG_MAX iommufd: Prevent ALIGN() overflow iommu/tegra241-cmdqv: import IOMMUFD module namespace iommufd: Do not allow _iommufd_object_alloc_ucmd if abort op is set iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support iommu/tegra241-cmdqv: Add user-space use support iommu/tegra241-cmdqv: Do not statically map LVCMDQs iommu/tegra241-cmdqv: Simplify deinit flow in tegra241_cmdqv_remove_vintf() iommu/tegra241-cmdqv: Use request_threaded_irq iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops ...
2025-07-30Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "ARM: - Host driver for GICv5, the next generation interrupt controller for arm64, including support for interrupt routing, MSIs, interrupt translation and wired interrupts - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on GICv5 hardware, leveraging the legacy VGIC interface - Userspace control of the 'nASSGIcap' GICv3 feature, allowing userspace to disable support for SGIs w/o an active state on hardware that previously advertised it unconditionally - Map supporting endpoints with cacheable memory attributes on systems with FEAT_S2FWB and DIC where KVM no longer needs to perform cache maintenance on the address range - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the guest hypervisor to inject external aborts into an L2 VM and take traps of masked external aborts to the hypervisor - Convert more system register sanitization to the config-driven implementation - Fixes to the visibility of EL2 registers, namely making VGICv3 system registers accessible through the VGIC device instead of the ONE_REG vCPU ioctls - Various cleanups and minor fixes LoongArch: - Add stat information for in-kernel irqchip - Add tracepoints for CPUCFG and CSR emulation exits - Enhance in-kernel irqchip emulation - Various cleanups RISC-V: - Enable ring-based dirty memory tracking - Improve perf kvm stat to report interrupt events - Delegate illegal instruction trap to VS-mode - MMU improvements related to upcoming nested virtualization s390x - Fixes x86: - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC, and PIT emulation at compile time - Share device posted IRQ code between SVM and VMX and harden it against bugs and runtime errors - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1) instead of O(n) - For MMIO stale data mitigation, track whether or not a vCPU has access to (host) MMIO based on whether the page tables have MMIO pfns mapped; using VFIO is prone to false negatives - Rework the MSR interception code so that the SVM and VMX APIs are more or less identical - Recalculate all MSR intercepts from scratch on MSR filter changes, instead of maintaining shadow bitmaps - Advertise support for LKGS (Load Kernel GS base), a new instruction that's loosely related to FRED, but is supported and enumerated independently - Fix a user-triggerable WARN that syzkaller found by setting the vCPU in INIT_RECEIVED state (aka wait-for-SIPI), and then putting the vCPU into VMX Root Mode (post-VMXON). Trying to detect every possible path leading to architecturally forbidden states is hard and even risks breaking userspace (if it goes from valid to valid state but passes through invalid states), so just wait until KVM_RUN to detect that the vCPU state isn't allowed - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling interception of APERF/MPERF reads, so that a "properly" configured VM can access APERF/MPERF. This has many caveats (APERF/MPERF cannot be zeroed on vCPU creation or saved/restored on suspend and resume, or preserved over thread migration let alone VM migration) but can be useful whenever you're interested in letting Linux guests see the effective physical CPU frequency in /proc/cpuinfo - Reject KVM_SET_TSC_KHZ for vm file descriptors if vCPUs have been created, as there's no known use case for changing the default frequency for other VM types and it goes counter to the very reason why the ioctl was added to the vm file descriptor. And also, there would be no way to make it work for confidential VMs with a "secure" TSC, so kill two birds with one stone - Dynamically allocation the shadow MMU's hashed page list, and defer allocating the hashed list until it's actually needed (the TDP MMU doesn't use the list) - Extract many of KVM's helpers for accessing architectural local APIC state to common x86 so that they can be shared by guest-side code for Secure AVIC - Various cleanups and fixes x86 (Intel): - Preserve the host's DEBUGCTL.FREEZE_IN_SMM when running the guest. Failure to honor FREEZE_IN_SMM can leak host state into guests - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter to prevent L1 from running L2 with features that KVM doesn't support, e.g. BTF x86 (AMD): - WARN and reject loading kvm-amd.ko instead of panicking the kernel if the nested SVM MSRPM offsets tracker can't handle an MSR (which is pretty much a static condition and therefore should never happen, but still) - Fix a variety of flaws and bugs in the AVIC device posted IRQ code - Inhibit AVIC if a vCPU's ID is too big (relative to what hardware supports) instead of rejecting vCPU creation - Extend enable_ipiv module param support to SVM, by simply leaving IsRunning clear in the vCPU's physical ID table entry - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by erratum #1235, to allow (safely) enabling AVIC on such CPUs - Request GA Log interrupts if and only if the target vCPU is blocking, i.e. only if KVM needs a notification in order to wake the vCPU - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to the vCPU's CPUID model - Accept any SNP policy that is accepted by the firmware with respect to SMT and single-socket restrictions. An incompatible policy doesn't put the kernel at risk in any way, so there's no reason for KVM to care - Drop a superfluous WBINVD (on all CPUs!) when destroying a VM and use WBNOINVD instead of WBINVD when possible for SEV cache maintenance - When reclaiming memory from an SEV guest, only do cache flushes on CPUs that have ever run a vCPU for the guest, i.e. don't flush the caches for CPUs that can't possibly have cache lines with dirty, encrypted data Generic: - Rework irqbypass to track/match producers and consumers via an xarray instead of a linked list. Using a linked list leads to O(n^2) insertion times, which is hugely problematic for use cases that create large numbers of VMs. Such use cases typically don't actually use irqbypass, but eliminating the pointless registration is a future problem to solve as it likely requires new uAPI - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a "void *", to avoid making a simple concept unnecessarily difficult to understand - Decouple device posted IRQs from VFIO device assignment, as binding a VM to a VFIO group is not a requirement for enabling device posted IRQs - Clean up and document/comment the irqfd assignment code - Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e. ensure an eventfd is bound to at most one irqfd through the entire host, and add a selftest to verify eventfd:irqfd bindings are globally unique - Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues related to private <=> shared memory conversions - Drop guest_memfd's .getattr() implementation as the VFS layer will call generic_fillattr() if inode_operations.getattr is NULL - Fix issues with dirty ring harvesting where KVM doesn't bound the processing of entries in any way, which allows userspace to keep KVM in a tight loop indefinitely - Kill off kvm_arch_{start,end}_assignment() and x86's associated tracking, now that KVM no longer uses assigned_device_count as a heuristic for either irqbypass usage or MDS mitigation Selftests: - Fix a comment typo - Verify KVM is loaded when getting any KVM module param so that attempting to run a selftest without kvm.ko loaded results in a SKIP message about KVM not being loaded/enabled (versus some random parameter not existing) - Skip tests that hit EACCES when attempting to access a file, and print a "Root required?" help message. In most cases, the test just needs to be run with elevated permissions" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (340 commits) Documentation: KVM: Use unordered list for pre-init VGIC registers RISC-V: KVM: Avoid re-acquiring memslot in kvm_riscv_gstage_map() RISC-V: KVM: Use find_vma_intersection() to search for intersecting VMAs RISC-V: perf/kvm: Add reporting of interrupt events RISC-V: KVM: Enable ring-based dirty memory tracking RISC-V: KVM: Fix inclusion of Smnpm in the guest ISA bitmap RISC-V: KVM: Delegate illegal instruction fault to VS mode RISC-V: KVM: Pass VMID as parameter to kvm_riscv_hfence_xyz() APIs RISC-V: KVM: Factor-out g-stage page table management RISC-V: KVM: Add vmid field to struct kvm_riscv_hfence RISC-V: KVM: Introduce struct kvm_gstage_mapping RISC-V: KVM: Factor-out MMU related declarations into separate headers RISC-V: KVM: Use ncsr_xyz() in kvm_riscv_vcpu_trap_redirect() RISC-V: KVM: Implement kvm_arch_flush_remote_tlbs_range() RISC-V: KVM: Don't flush TLB when PTE is unchanged RISC-V: KVM: Replace KVM_REQ_HFENCE_GVMA_VMID_ALL with KVM_REQ_TLB_FLUSH RISC-V: KVM: Rename and move kvm_riscv_local_tlb_sanitize() RISC-V: KVM: Drop the return value of kvm_riscv_vcpu_aia_init() RISC-V: KVM: Check kvm_riscv_vcpu_alloc_vector_context() return value KVM: arm64: selftests: Add FEAT_RAS EL2 registers to get-reg-list ...
2025-07-30Merge tag 'iommu-updates-v6.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux Pull iommu updates from Will Deacon: "Core: - Remove the 'pgsize_bitmap' member from 'struct iommu_ops' - Convert the x86 drivers over to msi_create_parent_irq_domain() AMD-Vi: - Add support for examining driver/device internals via debugfs - Add support for "HATDis" to disable host translation when it is not supported - Add support for limiting the maximum host translation level based on EFR[HATS] Apple DART: - Don't enable as built-in by default when ARCH_APPLE is selected Arm SMMU: - Devicetree bindings update for the Qualcomm SMMU in the "Milos" SoC - Support for Qualcomm SM6115 MDSS parts - Disable PRR on Qualcomm SM8250 as using these bits causes the hypervisor to explode Intel VT-d: - Reorganize Intel VT-d to be ready for iommupt - Optimize iotlb_sync_map for non-caching/non-RWBF modes - Fix missed PASID in dev TLB invalidation in cache_tag_flush_all() Mediatek: - Fix build warnings when W=1 Samsung Exynos: - Add support for reserved memory regions specified by the bootloader TI OMAP: - Use syscon_regmap_lookup_by_phandle_args() instead of parsing the node manually Misc: - Cleanups and minor fixes across the board" * tag 'iommu-updates-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (48 commits) iommu/vt-d: Fix UAF on sva unbind with pending IOPFs iommu/vt-d: Make iotlb_sync_map a static property of dmar_domain dt-bindings: arm-smmu: Remove sdm845-cheza specific entry iommu/amd: Fix geometry.aperture_end for V2 tables iommu/amd: Wrap debugfs ABI testing symbols snippets in literal code blocks iommu/amd: Add documentation for AMD IOMMU debugfs support iommu/amd: Add debugfs support to dump IRT Table iommu/amd: Add debugfs support to dump device table iommu/amd: Add support for device id user input iommu/amd: Add debugfs support to dump IOMMU command buffer iommu/amd: Add debugfs support to dump IOMMU Capability registers iommu/amd: Add debugfs support to dump IOMMU MMIO registers iommu/amd: Refactor AMD IOMMU debugfs initial setup dt-bindings: arm-smmu: document the support on Milos iommu/exynos: add support for reserved regions iommu/arm-smmu: disable PRR on SM8250 iommu/arm-smmu-v3: Revert vmaster in the error path iommu/io-pgtable-arm: Remove unused macro iopte_prot iommu/arm-smmu-qcom: Add SM6115 MDSS compatible iommu/qcom: Fix pgsize_bitmap ...
2025-07-29Merge tag 'kvm-x86-irqs-6.17' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM IRQ changes for 6.17 - Rework irqbypass to track/match producers and consumers via an xarray instead of a linked list. Using a linked list leads to O(n^2) insertion times, which is hugely problematic for use cases that create large numbers of VMs. Such use cases typically don't actually use irqbypass, but eliminating the pointless registration is a future problem to solve as it likely requires new uAPI. - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a "void *", to avoid making a simple concept unnecessarily difficult to understand. - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC, and PIT emulation at compile time. - Drop x86's irq_comm.c, and move a pile of IRQ related code into irq.c. - Fix a variety of flaws and bugs in the AVIC device posted IRQ code. - Inhibited AVIC if a vCPU's ID is too big (relative to what hardware supports) instead of rejecting vCPU creation. - Extend enable_ipiv module param support to SVM, by simply leaving IsRunning clear in the vCPU's physical ID table entry. - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by erratum #1235, to allow (safely) enabling AVIC on such CPUs. - Dedup x86's device posted IRQ code, as the vast majority of functionality can be shared verbatime between SVM and VMX. - Harden the device posted IRQ code against bugs and runtime errors. - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1) instead of O(n). - Generate GA Log interrupts if and only if the target vCPU is blocking, i.e. only if KVM needs a notification in order to wake the vCPU. - Decouple device posted IRQs from VFIO device assignment, as binding a VM to a VFIO group is not a requirement for enabling device posted IRQs. - Clean up and document/comment the irqfd assignment code. - Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e. ensure an eventfd is bound to at most one irqfd through the entire host, and add a selftest to verify eventfd:irqfd bindings are globally unique.
2025-07-24Merge branch 'intel/vt-d' into nextWill Deacon
* intel/vt-d: iommu/vt-d: Fix UAF on sva unbind with pending IOPFs iommu/vt-d: Make iotlb_sync_map a static property of dmar_domain iommu/vt-d: Deduplicate cache_tag_flush_all by reusing flush_range iommu/vt-d: Fix missing PASID in dev TLB flush with cache_tag_flush_all iommu/vt-d: Split paging_domain_compatible() iommu/vt-d: Split intel_iommu_enforce_cache_coherency() iommu/vt-d: Create unique domain ops for each stage iommu/vt-d: Split intel_iommu_domain_alloc_paging_flags() iommu/vt-d: Do not wipe out the page table NID when devices detach iommu/vt-d: Fold domain_exit() into intel_iommu_domain_free() iommu/vt-d: Lift the __pa to domain_setup_first_level/intel_svm_set_dev_pasid() iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes iommu/vt-d: Remove the CONFIG_X86 wrapping from iommu init hook
2025-07-23iommu/vt-d: Fix UAF on sva unbind with pending IOPFsLu Baolu
Commit 17fce9d2336d ("iommu/vt-d: Put iopf enablement in domain attach path") disables IOPF on device by removing the device from its IOMMU's IOPF queue when the last IOPF-capable domain is detached from the device. Unfortunately, it did this in a wrong place where there are still pending IOPFs. As a result, a use-after-free error is potentially triggered and eventually a kernel panic with a kernel trace similar to the following: refcount_t: underflow; use-after-free. WARNING: CPU: 3 PID: 313 at lib/refcount.c:28 refcount_warn_saturate+0xd8/0xe0 Workqueue: iopf_queue/dmar0-iopfq iommu_sva_handle_iopf Call Trace: <TASK> iopf_free_group+0xe/0x20 process_one_work+0x197/0x3d0 worker_thread+0x23a/0x350 ? rescuer_thread+0x4a0/0x4a0 kthread+0xf8/0x230 ? finish_task_switch.isra.0+0x81/0x260 ? kthreads_online_cpu+0x110/0x110 ? kthreads_online_cpu+0x110/0x110 ret_from_fork+0x13b/0x170 ? kthreads_online_cpu+0x110/0x110 ret_from_fork_asm+0x11/0x20 </TASK> ---[ end trace 0000000000000000 ]--- The intel_pasid_tear_down_entry() function is responsible for blocking hardware from generating new page faults and flushing all in-flight ones. Therefore, moving iopf_for_domain_remove() after this function should resolve this. Fixes: 17fce9d2336d ("iommu/vt-d: Put iopf enablement in domain attach path") Reported-by: Ethan Milon <ethan.milon@eviden.com> Closes: https://lore.kernel.org/r/e8b37f3e-8539-40d4-8993-43a1f3ffe5aa@eviden.com Suggested-by: Ethan Milon <ethan.milon@eviden.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250723072045.1853328-1-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org>
2025-07-21iommu/vt-d: Make iotlb_sync_map a static property of dmar_domainLu Baolu
Commit 12724ce3fe1a ("iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes") dynamically set iotlb_sync_map. This causes synchronization issues due to lack of locking on map and attach paths, racing iommufd userspace operations. Invalidation changes must precede device attachment to ensure all flushes complete before hardware walks page tables, preventing coherence issues. Make domain->iotlb_sync_map static, set once during domain allocation. If an IOMMU requires iotlb_sync_map but the domain lacks it, attach is rejected. This won't reduce domain sharing: RWBF and shadowing page table caching are legacy uses with legacy hardware. Mixed configs (some IOMMUs in caching mode, others not) are unlikely in real-world scenarios. Fixes: 12724ce3fe1a ("iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes") Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250721051657.1695788-1-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org>
2025-07-17iommu/vt-d: Use pci_is_display()Mario Limonciello
The inline pci_is_display() helper does the same thing. Use it. Suggested-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Daniel Dadap <ddadap@nvidia.com> Reviewed-by: Simona Vetter <simona.vetter@ffwll.ch> Link: https://patch.msgid.link/20250717173812.3633478-5-superm1@kernel.org
2025-07-14iommu/vt-d: Deduplicate cache_tag_flush_all by reusing flush_rangeEthan Milon
The logic in cache_tag_flush_all() to iterate over cache tags and issue TLB invalidations is largely duplicated in cache_tag_flush_range(), with the only difference being the range parameters. Extend cache_tag_flush_range() to handle a full address space flush when called with start = 0 and end = ULONG_MAX. This allows cache_tag_flush_all() to simply delegate to cache_tag_flush_range() Signed-off-by: Ethan Milon <ethan.milon@eviden.com> Link: https://lore.kernel.org/r/20250708214821.30967-2-ethan.milon@eviden.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250714045028.958850-12-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org>
2025-07-14iommu/vt-d: Fix missing PASID in dev TLB flush with cache_tag_flush_allEthan Milon
The function cache_tag_flush_all() was originally implemented with incorrect device TLB invalidation logic that does not handle PASID, in commit c4d27ffaa8eb ("iommu/vt-d: Add cache tag invalidation helpers") This causes regressions where full address space TLB invalidations occur with a PASID attached, such as during transparent hugepage unmapping in SVA configurations or when calling iommu_flush_iotlb_all(). In these cases, the device receives a TLB invalidation that lacks PASID. This incorrect logic was later extracted into cache_tag_flush_devtlb_all(), in commit 3297d047cd7f ("iommu/vt-d: Refactor IOTLB and Dev-IOTLB flush for batching") The fix replaces the call to cache_tag_flush_devtlb_all() with cache_tag_flush_devtlb_psi(), which properly handles PASID. Fixes: 4f609dbff51b ("iommu/vt-d: Use cache helpers in arch_invalidate_secondary_tlbs") Fixes: 4e589a53685c ("iommu/vt-d: Use cache_tag_flush_all() in flush_iotlb_all") Signed-off-by: Ethan Milon <ethan.milon@eviden.com> Link: https://lore.kernel.org/r/20250708214821.30967-1-ethan.milon@eviden.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250714045028.958850-11-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org>
2025-07-14iommu/vt-d: Split paging_domain_compatible()Jason Gunthorpe
Make First/Second stage specific functions that follow the same pattern in intel_iommu_domain_alloc_first/second_stage() for computing EOPNOTSUPP. This makes the code easier to understand as if we couldn't create a domain with the parameters for this IOMMU instance then we certainly are not compatible with it. Check superpage support directly against the per-stage cap bits and the pgsize_bitmap. Add a note that the force_snooping is read without locking. The locking needs to cover the compatible check and the add of the device to the list. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/7-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250714045028.958850-10-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org>
2025-07-14iommu/vt-d: Split intel_iommu_enforce_cache_coherency()Jason Gunthorpe
First Stage and Second Stage have very different ways to deny no-snoop. The first stage uses the PGSNP bit which is global per-PASID so enabling requires loading new PASID entries for all the attached devices. Second stage uses a bit per PTE, so enabling just requires telling future maps to set the bit. Since we now have two domain ops we can have two functions that can directly code their required actions instead of a bunch of logic dancing around use_first_level. Combine domain_set_force_snooping() into the new functions since they are the only caller. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/6-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250714045028.958850-9-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org>
2025-07-14iommu/vt-d: Create unique domain ops for each stageJason Gunthorpe
Use the domain ops pointer to tell what kind of domain it is instead of the internal use_first_level indication. This also protects against wrongly using a SVA/nested/IDENTITY/BLOCKED domain type in places they should not be. The only remaining uses of use_first_level outside the paging domain are in paging_domain_compatible() and intel_iommu_enforce_cache_coherency(). Thus, remove the useless sets of use_first_level in intel_svm_domain_alloc() and intel_iommu_domain_alloc_nested(). None of the unique ops for these domain types ever reference it on their call chains. Add a WARN_ON() check in domain_context_mapping_one() as it only works with second stage. This is preparation for iommupt which will have different ops for each of the stages. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/5-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250714045028.958850-8-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org>