summaryrefslogtreecommitdiff
path: root/drivers/iommu/generic_pt
AgeCommit message (Collapse)Author
6 daysMerge branches 'apple/dart', 'arm/smmu/updates', 'arm/smmu/bindings', ↵Joerg Roedel
'rockchip', 'verisilicon', 'riscv', 'intel/vt-d', 'amd/amd-vi' and 'core' into next
2026-05-19iommu_pt: add kunit config for 32-bit VA (amdv1_cfg_1)Ankit Soni
Add test coverage for small VAs (32‑bit) starting at level 2 by enabling the AMDv1 KUnit configuration. This limits level expansion because the starting level can accommodate only the maximum virtual address requested. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Ankit Soni <Ankit.Soni@amd.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-05-19iommu_pt: support small VA for AMDv1Ankit Soni
When hardware/VM request a small VA limit, the generic page-table code clears PT_FEAT_DYNAMIC_TOP. This later causes domain initialization to fail with -EOPNOTSUPP. Remove the clearing so init succeeds when the VA fits in the starting level and no top-level growth is needed. Signed-off-by: Ankit Soni <Ankit.Soni@amd.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-05-19iommu_pt: Fix pgsize_bitmap calculation in get_info for smaller vasz'sAnkit Soni
To properly enforce the domain VA limit, clamp pgsize_bitmap using the requested max_vasz_lg2 in get_info(). Apply the same VA limit as get_info() in the kunit possible_sizes test so assertions stay consistent with the domain bitmap. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Ankit Soni <Ankit.Soni@amd.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-05-19iommu/riscv: Enable PT_FEAT_DETAILED_GATHER and pass gather to iotlb_invalJason Gunthorpe
RISC-V can use the information from PT_FEAT_DETAILED_GATHER to compute the best stride to generate the single TLB invalidations. Pass the gather down to the lower functions and create a full-range gather for the flush-all callback. Reviewed-by: Tomasz Jeznach <tjeznach@rivosinc.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Andrew Jones <andrew.jones@oss.qualcomm.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-05-19iommupt: Add PT_FEAT_DETAILED_GATHERJason Gunthorpe
Generating the ARM SMMUv3 and RISC-V invalidation commands optimally requires some additional details from iommupt: - leaf_levels_bitmap is used to compute the ARM Range Invalidation Table Top Level hint - leaf_levels_bitmap is also used to compute the stride when generating single invalidations to invalidate once per leaf - table_levels_bitmap also computes the ARM TTL for future cases when there are no leaves Put these under a feature since only two drivers need to calculate them. This is also useful for the coming kunit iotlb invalidation test to know more about what invalidation is happening. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Tested-by: Andrew Jones <andrew.jones@oss.qualcomm.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-05-19iommupt: Add struct iommupt_pending_gatherJason Gunthorpe
Add a struct to keep track of all the things that are pending to be merged into the gather. The way gather merging works, the pending range is checked against the current gather, and the current gather can be flushed before the pending things are added. Thus, if new things have to be recorded in the gather they need to be kept in the pending struct until after the gather is optionally flushed. The next patch adds new items to the gather and the pending struct. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Tested-by: Andrew Jones <andrew.jones@oss.qualcomm.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-05-15iommupt: Fixup build warning by using BIT_ULL() for RISCVPT_NC/IOFangyu Yu
Fix build warning on 32-bit configurations by using BIT_ULL() for RISCVPT_NC and RISCVPT_IO. Fixes: 6c21eb174c6c ("iommupt: Encode IOMMU_MMIO/IOMMU_CACHE via RISC-V Svpbmt bits") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202605121350.wZxB51k0-lkp@intel.com/ Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-05-15iommupt: Fix the end_index calculation in __map_range_leaf()Jason Gunthorpe
Sashiko noticed a mismatch of units in this math: num_leaves is actually the number of leaf *entries* (so a 16-item contiguous leaf is one num_leaves), while index is in items. The mismatch in maths causes __map_range_leaf() to exit early instead of efficiently filling a larger range of contiguous PTEs. The early exit is caught by the functions above and then __map_range_leaf() is re-invoked, so there is no functional issue. Correct the misuse of units by adjusting num_leaves with the leaf size and avoid the performance cost of looping externally. There are also some mismatched types for num_leaves; simplify things to remove the duplicated calculations. Fixes: d6c65b0fd621 ("iommupt: Avoid rewalking during map") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Reviewd-by: Pranjal Shrivastava <praan@google.com> Tested-by: Josua Mayer <josua@solid-run.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-05-15iommupt: Check for missing PAGE_SIZE in the pgsize_bitmapJason Gunthorpe
Sashiko pointed out that the driver could drop PAGE_SIZE from the pgsize_bitmap. That is technically allowed but nothing does it, and such an iommu_domain would not be used with the DMA API today. Still, it is against the design and it is trivial to fix up. Lift the PT_WARN_ON to the if branch and just skip the fast path. Fixes: dcd6a011a8d5 ("iommupt: Add map_pages op") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Tested-by: Josua Mayer <josua@solid-run.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-05-11iommupt: Encode IOMMU_MMIO/IOMMU_CACHE via RISC-V Svpbmt bitsFangyu Yu
When the RISC-V IOMMU page table format support Svpbmt, PBMT provides a way to tag mappings with page-based memory types. Encode memory type via PBMT in RISC-V IOMMU PTEs: - IOMMU_MMIO -> PBMT=IO - !IOMMU_MMIO && !IOMMU_CACHE -> PBMT=NC - otherwise -> PBMT=Normal (PBMT=0) Only touch PBMT when PT_FEAT_RISCV_SVPBMT is advertised. Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Guo Ren <guoren@kernel.org> Reviewed-by: Nutty Liu <nutty.liu@hotmail.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-04-09Merge branches 'fixes', 'arm/smmu/updates', 'arm/smmu/bindings', 'riscv', ↵Will Deacon
'intel/vt-d', 'amd/amd-vi' and 'core' into next
2026-03-27iommupt/amdv1: mark amdv1pt_install_leaf_entry as __always_inlineSherry Yang
After enabling CONFIG_GCOV_KERNEL and CONFIG_GCOV_PROFILE_ALL, following build failure is observed under GCC 14.2.1: In function 'amdv1pt_install_leaf_entry', inlined from '__do_map_single_page' at drivers/iommu/generic_pt/fmt/../iommu_pt.h:650:3, inlined from '__map_single_page0' at drivers/iommu/generic_pt/fmt/../iommu_pt.h:661:1, inlined from 'pt_descend' at drivers/iommu/generic_pt/fmt/../pt_iter.h:391:9, inlined from '__do_map_single_page' at drivers/iommu/generic_pt/fmt/../iommu_pt.h:657:10, inlined from '__map_single_page1.constprop' at drivers/iommu/generic_pt/fmt/../iommu_pt.h:661:1: ././include/linux/compiler_types.h:706:45: error: call to '__compiletime_assert_71' declared with attribute error: FIELD_PREP: value too large for the field 706 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ...... drivers/iommu/generic_pt/fmt/amdv1.h:220:26: note: in expansion of macro 'FIELD_PREP' 220 | FIELD_PREP(AMDV1PT_FMT_OA, | ^~~~~~~~~~ In the path '__do_map_single_page()', level 0 always invokes 'pt_install_leaf_entry(&pts, map->oa, PAGE_SHIFT, …)'. At runtime that lands in the 'if (oasz_lg2 == isz_lg2)' arm of 'amdv1pt_install_leaf_entry()'; the contiguous-only 'else' block is unreachable for 4 KiB pages. With CONFIG_GCOV_KERNEL + CONFIG_GCOV_PROFILE_ALL, the extra instrumentation changes GCC's inlining so that the "dead" 'else' branch still gets instantiated. The compiler constant-folds the contiguous OA expression, runs the 'FIELD_PREP()' compile-time check, and produces: FIELD_PREP: value too large for the field gcov-enabled builds therefore fail even though the code path never executes. Fix this by marking amdv1pt_install_leaf_entry as __always_inline. Fixes: dcd6a011a8d5 ("iommupt: Add map_pages op") Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sherry Yang <sherry.yang@oracle.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-03-27iommupt: Fix short gather if the unmap goes into a large mappingJason Gunthorpe
unmap has the odd behavior that it can unmap more than requested if the ending point lands within the middle of a large or contiguous IOPTE. In this case the gather should flush everything unmapped which can be larger than what was requested to be unmapped. The gather was only flushing the range requested to be unmapped, not extending to the extra range, resulting in a short invalidation if the caller hits this special condition. This was found by the new invalidation/gather test I am adding in preparation for ARMv8. Claude deduced the root cause. As far as I remember nothing relies on unmapping a large entry, so this is likely not a triggerable bug. Cc: stable@vger.kernel.org Fixes: 7c53f4238aa8 ("iommupt: Add unmap_pages op") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-03-17iommupt: Avoid rewalking during mapJason Gunthorpe
Currently the core code provides a simplified interface to drivers where it fragments a requested multi-page map into single page size steps after doing all the calculations to figure out what page size is appropriate. Each step rewalks the page tables from the start. Since iommupt has a single implementation of the mapping algorithm it can internally compute each step as it goes while retaining its current position in the walk. Add a new function pt_pgsz_count() which computes the same page size fragement of a large mapping operations. Compute the next fragment when all the leaf entries of the current fragement have been written, then continue walking from the current point. The function pointer is run through pt_iommu_ops instead of iommu_domain_ops to discourage using it outside iommupt. All drivers with their own page tables should continue to use the simplified map_pages() style interfaces. Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-03-17iommupt: Directly call iommupt's unmap_range()Jason Gunthorpe
The common algorithm in iommupt does not require the iommu_pgsize() calculations, it can directly unmap any arbitrary range. Add a new function pointer to directly call an iommupt unmap_range op and make __iommu_unmap() call it directly. Gives about a 5% gain on single page unmappings. The function pointer is run through pt_iommu_ops instead of iommu_domain_ops to discourage using it outside iommupt. All drivers with their own page tables should continue to use the simplified map/unmap_pages() style interfaces. Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-03-17iommupt: Optimize the gather processing for DMA-FQ modeJason Gunthorpe
In PT_FEAT_FLUSH_RANGE mode the gather was accumulated but never flushed and then the accumulated range was discarded by the dma-iommu code in DMA-FQ mode. This is basically optimal. However for PT_FEAT_FLUSH_RANGE_NO_GAPS the page table would push flushes that are redundant with the flush all generated by the DMA-FQ mode. Disable all range accumulation in the gather, and iommu_pt triggered flushing when in iommu_iotlb_gather_queued() indicates it is in DMA-FQ mode. Reported-by: Robin Murphy <robin.murphy@arm.com> Closes: https://lore.kernel.org/r/794b6121-b66b-4819-b291-9761ed21cd83@arm.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-03-17iommupt: Add the RISC-V page table formatJason Gunthorpe
The RISC-V format is a fairly simple 5 level page table not unlike the x86 one. It has optional support for a single contiguous page size of 64k (16 x 4k). The specification describes a 32-bit format, the general code can support it via a #define but the iommu side implementation has been left off until a user comes. Tested-by: Vincent Chen <vincent.chen@sifive.com> Acked-by: Paul Walmsley <pjw@kernel.org> # arch/riscv Reviewed-by: Tomasz Jeznach <tjeznach@rivosinc.com> Tested-by: Tomasz Jeznach <tjeznach@rivosinc.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-02-06Merge branches 'fixes', 'arm/smmu/updates', 'intel/vt-d', 'amd/amd-vi' and ↵Joerg Roedel
'core' into next
2026-02-03iommupt: Always add IOVA range to iotlb_gather in gather_range_pages()Yu Zhang
Add current (iova, len) to the iotlb gather, regardless of the setting of PT_FEAT_FLUSH_RANGE or PT_FEAT_FLUSH_RANGE_NO_GAPS. In gather_range_pages(), the current IOVA range is only added to iotlb_gather when PT_FEAT_FLUSH_RANGE is set. Yet a virtual IOMMU with NpCache uses only PT_FEAT_FLUSH_RANGE_NO_GAPS. In that case, iotlb_gather will stay empty (start=ULONG_MAX, end=0) after initialization, and the current (iova, len) will not be added to the iotlb_gather, causing subsequent iommu_iotlb_sync() to perform IOTLB invalidation with wrong parameters (e.g., amd_iommu_iotlb_sync() computes size from gather->end - gather->start + 1, leading to an invalid range). The disjoint check and sync for PT_FEAT_FLUSH_RANGE_NO_GAPS remain unchanged: when the new range is disjoint from the existing gather, we still sync first and then add the new range, so semantics for NO_GAPS are preserved. Fixes: 7c53f4238aa8 ("iommupt: Add unmap_pages op") Cc: stable@vger.kernel.org Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-28iommupt: Only cache flush memory changed by unmapJason Gunthorpe
The cache flush was happening on every level across the whole range of iteration, even if no leafs or tables were cleared. Instead flush only the sub range that was actually written. Overflushing isn't a correctness problem but it does impact the performance of unmap. After this series the performance compared to the original VT-d implementation with cache flushing turned on is: map_pages pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 253,266 , 213,227 , 6.06 2^21, 246,244 , 221,219 , 0.00 2^30, 231,240 , 209,217 , 3.03 256*2^12, 2604,2668 , 2415,2540 , 4.04 256*2^21, 2495,2824 , 2390,2734 , 12.12 256*2^30, 2542,2845 , 2380,2718 , 12.12 unmap_pages pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 259,292 , 222,251 , 11.11 2^21, 255,259 , 227,236 , 3.03 2^30, 238,254 , 217,230 , 5.05 256*2^12, 2751,2620 , 2417,2437 , 0.00 256*2^21, 2461,2526 , 2377,2423 , 1.01 256*2^30, 2498,2543 , 2370,2404 , 1.01 Fixes: efa03dab7ce4 ("iommupt: Flush the CPU cache after any writes to the page table") Reported-by: Francois Dugast <francois.dugast@intel.com> Closes: https://lore.kernel.org/all/20260121130233.257428-1-francois.dugast@intel.com/ Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-20iommupt: Make it clearer to the compiler that pts.level == 0 for single pageJason Gunthorpe
Older versions of gcc and clang sometimes get tripped up by the build time assertion in FIELD_PREP because they can see that the argument to FIELD_PREP is constant but can't see that the if condition protecting it is also a constant false. In file included from <command-line>: In function 'amdv1pt_install_leaf_entry', inlined from '__do_map_single_page' at drivers/iommu/generic_pt/fmt/../iommu_pt.h:651:3, inlined from '__map_single_page0' at drivers/iommu/generic_pt/fmt/../iommu_pt.h:662:1, inlined from 'pt_descend' at drivers/iommu/generic_pt/fmt/../pt_iter.h:391:9, inlined from '__do_map_single_page' at drivers/iommu/generic_pt/fmt/../iommu_pt.h:658:10, inlined from '__map_single_page1.constprop' at drivers/iommu/generic_pt/fmt/../iommu_pt.h:662:1: ././include/linux/compiler_types.h:631:45: error: call to '__compiletime_assert_251' declared with attribute error: FIELD_PREP: value too large for the field 631 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^ ././include/linux/compiler_types.h:612:25: note: in definition of macro '__compiletime_assert' 612 | prefix ## suffix(); \ | ^~~~~~ ././include/linux/compiler_types.h:631:9: note: in expansion of macro '_compiletime_assert' 631 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^~~~~~~~~~~~~~~~~~~ ./include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) | ^~~~~~~~~~~~~~~~~~ ./include/linux/bitfield.h:69:17: note: in expansion of macro 'BUILD_BUG_ON_MSG' 69 | BUILD_BUG_ON_MSG(__builtin_constant_p(_val) ? \ | ^~~~~~~~~~~~~~~~ ./include/linux/bitfield.h:90:17: note: in expansion of macro '__BF_FIELD_CHECK_MASK' 90 | __BF_FIELD_CHECK_MASK(mask, val, pfx); \ | ^~~~~~~~~~~~~~~~~~~~~ ./include/linux/bitfield.h:137:17: note: in expansion of macro '__FIELD_PREP' 137 | __FIELD_PREP(_mask, _val, "FIELD_PREP: "); \ | ^~~~~~~~~~~~ drivers/iommu/generic_pt/fmt/amdv1.h:220:26: note: in expansion of macro 'FIELD_PREP' 220 | FIELD_PREP(AMDV1PT_FMT_OA, | ^~~~~~~~~~ Changing the caller to check pts.level == 0 avoids demanding a bit of complex reasoning from the compiler that pts.level == level == 0. Instead the compiler sees that pt_install_leaf_entry() is called with a constant pts.level == 0 which makes it more reliable to see the constant false in the if. Fixes: dcd6a011a8d5 ("iommupt: Add map_pages op") Reported-by: Chunyu Hu <chuhu@redhat.com> Closes: https://lore.kernel.org/all/aUn9uGPCooqB-RIF@gmail.com/ Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-19iommupt: Do not set C-bit on MMIO backed PTEsWei Wang
AMD Secure Memory Encryption (SME) marks individual memory pages as encrypted by setting the C-bit in page table entries. According to the AMD APM,any pages corresponding to MMIO addresses must be configured with the C-bit clear. The current *_iommu_set_prot() implementation sets the C-bit on all PTEs in the IOMMU page tables. This is incorrect for PTEs backed by MMIO, and can break PCIe peer-to-peer communication when IOVA is used. Fix this by avoiding the C-bit for MMIO-backed mappings. For amdv2 IOMMU page tables, there is a usage scenario for GVA->GPA mappings, and for the trusted MMIO in the TEE-IO case, the C-bit will need to be added to GPA. However, SNP guests do not yet support vIOMMU, and the trusted MMIO support is not ready in upstream. Adding the C-bit for trusted MMIO can be considered once those features land. Fixes: 879ced2bab1b ("iommupt: Add the AMD IOMMU v1 page table format") Fixes: aef5de756ea8 ("iommupt: Add the x86 64 bit page table format") Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Wei Wang <wei.w.wang@hotmail.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2026-01-10iommupt: Make pt_feature() always_inlineJason Gunthorpe
gcc 8.5 on powerpc does not automatically inline these functions even though they evaluate to constants in key cases. Since the constant propagation is essential for some code elimination and built-time checks this causes a build failure: ERROR: modpost: "__pt_no_sw_bit" [drivers/iommu/generic_pt/fmt/iommu_amdv1.ko] undefined! Caused by this: if (pts_feature(&pts, PT_FEAT_DMA_INCOHERENT) && !pt_test_sw_bit_acquire(&pts, SW_BIT_CACHE_FLUSH_DONE)) flush_writes_item(&pts); Where pts_feature() evaluates to a constant false. Mark them as __always_inline to force it to evaluate to a constant and trigger the code elimination. Fixes: 7c5b184db714 ("genpt: Generic Page Table base API") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202512230720.9y9DtWIo-lkp@intel.com/ Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-01-10iommupt: Fix the kunit buildingJason Gunthorpe
The kunit doesn't work since the below commit made GENERIC_PT unselectable: $ make ARCH=x86_64 O=build_kunit_x86_64 olddefconfig ERROR:root:Not all Kconfig options selected in kunitconfig were in the generated .config. This is probably due to unsatisfied dependencies. Missing: CONFIG_DEBUG_GENERIC_PT=y, CONFIG_IOMMUFD_TEST=y, CONFIG_IOMMU_PT_X86_64=y, CONFIG_GENERIC_PT=y, CONFIG_IOMMU_PT_AMDV1=y, CONFIG_IOMMU_PT_VTDSS=y, CONFIG_IOMMU_PT=y, CONFIG_IOMMU_PT_KUNIT_TEST=y Also remove the unneeded CONFIG_IOMMUFD_TEST reference as the iommupt kunit doesn't interact with iommufd, and it doesn't currently build for the kunit due problems with DMA_SHARED buffer either. Fixes: 01569c216dde ("genpt: Make GENERIC_PT invisible") Fixes: 1dd4187f53c3 ("iommupt: Add a kunit test for Generic Page Table") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-12-18iommupt: Return ERR_PTR from _table_alloc()Jason Gunthorpe
syzkaller noticed that with fault injection a failure inside iommu_alloc_pages_node_sz() oops's in PT_FEAT_DMA_INCOHERENT because it goes on to make NULL incoherent. Closer inspection shows the return value has become confused, the alloc routines on the iommupt side expect ERR_PTR while iommu_alloc_pages_node_sz() returns NULL. Error out early to fix both issues. Fixes: aefd967dab64 ("iommupt: Use the incoherent start/stop functions for PT_FEAT_DMA_INCOHERENT") Fixes: dcd6a011a8d5 ("iommupt: Add map_pages op") Fixes: cdb39d918579 ("iommupt: Add the basic structure of the iommu implementation") Reported-by: syzbot+e06bb7478e687f235ad7@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/693a39de.050a0220.4004e.02ce.GAE@google.com/ Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-28iommupt/vtd: Support mgaw's less than a 4 level walk for first stageJason Gunthorpe
If the IOVA is limited to less than 48 the page table will be constructed with a 3 level configuration which is unsupported by hardware. Like the second stage the caller needs to pass in both the top_level an the vasz to specify a table that has more levels than required to hold the IOVA range. Fixes: 6cbc09b7719e ("iommu/vt-d: Restore previous domain::aperture_end calculation") Reported-by: Calvin Owens <calvin@wbinvd.org> Closes: https://lore.kernel.org/r/8f257d2651eb8a4358fcbd47b0145002e5f1d638.1764237717.git.calvin@wbinvd.org Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Calvin Owens <calvin@wbinvd.org> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-28iommupt/vtd: Allow VT-d to have a larger table top than the vasz requiresJason Gunthorpe
VT-d second stage HW specifies both the maximum IOVA and the supported table walk starting points. Weirdly there is HW that only supports a 4 level walk but has a maximum IOVA that only needs 3. The current code miscalculates this and creates a wrongly sized page table which ultimately fails the compatibility check for number of levels. This is fixed by allowing the page table to be created with both a vasz and top_level input. The vasz will set the aperture for the domain while the top_level will set the page table geometry. Add top_level to vtdss and correct the logic in VT-d to generate the right top_level and vasz from mgaw and sagaw. Fixes: d373449d8e97 ("iommu/vt-d: Use the generic iommu page table") Reported-by: Calvin Owens <calvin@wbinvd.org> Closes: https://lore.kernel.org/r/8f257d2651eb8a4358fcbd47b0145002e5f1d638.1764237717.git.calvin@wbinvd.org Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Calvin Owens <calvin@wbinvd.org> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-28genpt: Make GENERIC_PT invisibleGeert Uytterhoeven
There is no point in asking the user about the Generic Radix Page Table API: - All IOMMU drivers that use this API already select GENERIC_PT when needed, - Most users probably do not know what to answer anyway. Fixes: 7c5b184db7145fd4 ("genpt: Generic Page Table base API") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-27iommupt: Avoid a compiler bug with sw_bitJason Gunthorpe
gcc 13, in some cases, gets confused if the __builtin_constant_p() is inside the switch. It thinks that bitnr can have the value max+1 and fails. Lift the check outside the switch to avoid it. Fixes: ef7bfe5bbffd ("iommupt/x86: Support SW bits and permit PT_FEAT_DMA_INCOHERENT") Fixes: 5448c1558f60 ("iommupt: Add the Intel VT-d second stage page table format") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202511242012.I7g504Ab-lkp@intel.com/ Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-25iommupt: Fix unlikely flows in increase_top()Jason Gunthorpe
Since increase_top() does it's own READ_ONCE() on top_of_table, the caller's prior READ_ONCE() could be inconsistent and the first time through the loop we may actually already have the right level if two threads are racing map. In this case new_level will be left uninitialized. Further all the exits from the loop have to either commit to the new top or free any memory allocated so the early return must be a goto err_free. Make it so the only break from the loop always sets new_level to the right value and all other exits go to err_free. Use pts.level (the pts represents the top we are stacking) within the loop instead of new_level. Fixes: dcd6a011a8d5 ("iommupt: Add map_pages op") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/aRwgNW9PiW2j-Qwo@stanley.mountain Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-17iommupt: Actually correct pt_test_sw_bit_{acquire_release}() parameter ↵Bagas Sanjaya
description In review comment for v1 of genpt documentation fixes [1], Randy suggested that pt_test_sw_bit_acquire() parameters description should be written using "to read". Commit e4dfaf25df1210 ("iommupt: Describe @bitnr parameter"), however, misunderstood the review by instead using "to read" on @bitnr parameter on both pt_test_sw_bit_acquire() and pt_test_sw_bit_release(). Actually correct the description. [1]: https://lore.kernel.org/linux-doc/9dba0eb7-6f32-41b7-b70b-12379364585f@infradead.org/ Fixes: e4dfaf25df1210 ("iommupt: Describe @bitnr parameter") Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-07iommu/iommupt: Fix build error in genericpt unit-testsJoerg Roedel
Fix the include of iommu-pages.h in the KUnit tests for the IOMMU generic page-table code to make the tests compile again. Reported-by: Thorsten Leemhuis <linux@leemhuis.info> Closes: https://lore.kernel.org/all/9844d4cb-f517-478b-9911-b6dc1a963b8e@leemhuis.info/ Reported-by: "Longia, Amandeep Kaur" <amandeepkaur.longia@amd.com> Closes: https://lore.kernel.org/all/e641c955-25ad-4eae-b3fe-4392966cf768@amd.com/ Fixes: 1dd4187f53c3 ("iommupt: Add a kunit test for Generic Page Table") Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Thorsten Leemhuis <linux@leemhuis.info> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-07iommupt: Documentation fixesJason Gunthorpe
Some adjustments pointed out by Randy: "decodes an full 64-bit" -> "decodes the full 64 bit" Correct the function parameter name for iova_to_phys() Use the recommended section heading style. Suggested-by: Randy Dunlap <rdunlap@infradead.org> Fixes: ab0b572847ac ("genpt: Add Documentation/ files") Fixes: 879ced2bab1b ("iommupt: Add the AMD IOMMU v1 page table format") Fixes: 9d4c274cd7d5 ("iommupt: Add iova_to_phys op") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-07iommupt: Describe @bitnr parameterBagas Sanjaya
Sphinx reports kernel-doc warnings when making htmldocs: WARNING: ./drivers/iommu/generic_pt/pt_common.h:361 function parameter 'bitnr' not described in 'pt_test_sw_bit_acquire' WARNING: ./drivers/iommu/generic_pt/pt_common.h:371 function parameter 'bitnr' not described in 'pt_set_sw_bit_release' Describe @bitnr to squash them. Fixes: bcc64b57b48e ("iommupt: Add basic support for SW bits in the page table") Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Add a kunit test for the SW bitsJason Gunthorpe
Add some basic checks that the sw_bit APIs work as expected. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt/x86: Support SW bits and permit PT_FEAT_DMA_INCOHERENTJason Gunthorpe
VT-d requires PT_FEAT_DMA_INCOHERENT for the x86 page table as well, implement the required SW bits and enable the feature. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt/x86: Set the dirty bit only for writable PTEsJason Gunthorpe
AMD and VTD are historically different here, adopt the VTD version of setting the D bit only on writable PTEs as it makes more sense. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Add the Intel VT-d second stage page table formatJason Gunthorpe
The VT-d second stage format is almost the same as the x86 PAE format, except the bit encodings in the PTE are different and a few new PTE features, like force coherency are present. Among all the formats it is unique in not having a designated present bit. Comparing the performance of several operations to the existing version: iommu_map() pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 53,66 , 50,64 , 21.21 2^21, 59,70 , 56,67 , 16.16 2^30, 54,66 , 52,63 , 17.17 256*2^12, 384,524 , 337,516 , 34.34 256*2^21, 387,632 , 336,626 , 46.46 256*2^30, 376,629 , 323,623 , 48.48 iommu_unmap() pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 67,86 , 63,84 , 25.25 2^21, 64,84 , 59,80 , 26.26 2^30, 59,78 , 56,74 , 24.24 256*2^12, 216,335 , 198,317 , 37.37 256*2^21, 245,350 , 232,344 , 32.32 256*2^30, 248,345 , 226,339 , 33.33 Cc: Tina Zhang <tina.zhang@intel.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Flush the CPU cache after any writes to the page tableJason Gunthorpe
Flush the CPU cache for the page table memory after each set of writes to the page table. The iommu should have visibility to the updated entries as soon as the map/unmap/etc operations return, like normal coherent hardware does. The caches also have to be flushed before any gather can be submitted to the driver. Implement the same solution to the race as io-pgtable-arm by using a software PTE bit to track if a table entry has been flushed or not. If another thread is still flushing then another concurrent map operation could return without IOMMU visibility to a required table entry. The SW bit will tell the second thread to also flush the cache. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Use the incoherent start/stop functions for PT_FEAT_DMA_INCOHERENTJason Gunthorpe
This is the first step to supporting an incoherent walker, start and stop the incoherence around the allocation and frees of the page table memory. The iommu_pages API maps this to dma_map/unmap_single(), or arch cache flushing calls. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Add basic support for SW bits in the page tableJason Gunthorpe
SW bits can be placed on items, including table entries, single OA's and individual items within a contiguous OA. They are guaranteed to be ignored by the HW. The API is very basic since the only use case so far is a single bit. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Add a kunit test for the IOMMU implementationJason Gunthorpe
This intends to have high coverage of the page table format functions and the IOMMU implementation itself, exercising the various corner cases. The kunit tests can be run in the kunit framework, using commands like: tools/testing/kunit/kunit.py run --build_dir build_kunit_arm64 --arch arm64 --make_options LLVM=-19 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig tools/testing/kunit/kunit.py run --build_dir build_kunit_uml --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig tools/testing/kunit/kunit.py run --build_dir build_kunit_x86_64 --arch x86_64 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig tools/testing/kunit/kunit.py run --build_dir build_kunit_i386 --arch i386 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig tools/testing/kunit/kunit.py run --build_dir build_kunit_i386pae --arch i386 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig --kconfig_add CONFIG_X86_PAE=y There are several interesting corner cases on the 32 bit platforms that need checking. Like the generic tests, these are run on the format's configuration list using kunit "params". This also checks the core iommu parts of the page table code as it enters the logic through a mock iommu_domain. The following are checked: - PT_FEAT_DYNAMIC_TOP properly adds levels one by one - Every page size can be iommu_map()'d, and mapping creates that size - iommu_iova_to_phys() works with every page size - Test converting OA -> non present -> OA when the two OAs overlap and free table levels - Test that unmap stops at holes, unmap doesn't split, and unmap returns the right values for partial unmap requests - Randomly map/unmap. Checks map with random sizes, that map fails when hitting collisions doing nothing, unmap/map with random intersections and full unmap of random sizes. Also checks iommu_iova_to_phys() with random sizes - Check for memory leaks by monitoring NR_SECONDARY_PAGETABLE Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Add the x86 64 bit page table formatJason Gunthorpe
This is used by x86 CPUs and can be used in AMD/VT-d x86 IOMMUs. When a x86 IOMMU is running SVA the MM will be using this format. This implementation follows the AMD v2 io-pgtable version. There is nothing remarkable here, the format can have 4 or 5 levels and limited support for different page sizes. No contiguous pages support. x86 uses a sign extension mechanism where the top bits of the VA must match the sign bit. The core code supports this through PT_FEAT_SIGN_EXTEND which creates and upper and lower VA range. All the new operations will work correctly in both spaces, however currently there is no way to report the upper space to other layers. Future patches can improve that. In principle this can support 3 page tables levels matching the 32 bit PAE table format, but no iommu driver needs this. The focus is on the modern 64 bit 4 and 5 level formats. Comparing the performance of several operations to the existing version: iommu_map() pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 71,61 , 66,58 , -13.13 2^21, 66,60 , 61,55 , -10.10 2^30, 59,56 , 56,54 , -3.03 256*2^12, 392,1360 , 345,1289 , 73.73 256*2^21, 383,1159 , 335,1145 , 70.70 256*2^30, 378,965 , 331,892 , 62.62 iommu_unmap() pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 77,71 , 73,68 , -7.07 2^21, 76,70 , 70,66 , -6.06 2^30, 69,66 , 66,63 , -4.04 256*2^12, 225,899 , 210,870 , 75.75 256*2^21, 262,722 , 248,710 , 65.65 256*2^30, 251,643 , 244,634 , 61.61 The small -ve values in the iommu_unmap() are due to the core code calling iommu_pgsize() before invoking the domain op. This is unncessary with this implementation. Future work optimizes this and gets to 2%, 4%, 3%. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommufd: Change the selftest to use iommupt instead of xarrayJason Gunthorpe
The iommufd self test uses an xarray to store the pfns and their orders to emulate a page table. Make it act more like a real iommu driver by replacing the xarray with an iommupt based page table. The new AMDv1 mock format behaves similarly to the xarray. Add set_dirty() as a iommu_pt operation to allow the test suite to simulate HW dirty. Userspace can select between several formats including the normal AMDv1 format and a special MOCK_IOMMUPT_HUGE variation for testing huge page dirty tracking. To make the dirty tracking test work the page table must only store exactly 2M huge pages otherwise the logic the test uses fails. They cannot be broken up or combined. Aside from aligning the selftest with a real page table implementation, this helps test the iommupt code itself. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Add a mock pagetable format for iommufd selftest to useJason Gunthorpe
The iommufd self test uses an xarray to store the pfns and their orders to emulate a page table. Slightly modify the amdv1 page table to create a real page table that has similar properties: - 2k base granule to simulate something like a 4k page table on a 64K PAGE_SIZE ARM system - Contiguous page support for every PFN order - Dirty tracking AMDv1 is the closest format, as it is the only one that already supports every page size. Tweak it to have only 5 levels and an 11 bit base granule and compile it separately as a format variant. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Add a kunit test for Generic Page TableJason Gunthorpe
This intends to have high coverage of the page table format functions, it uses the IOMMU implementation to create a tree which it then walks through and directly calls the generic page table functions to test them. It is a good starting point to test a new format header as it is often able to find typos and inconsistencies much more directly, rather than with an obscure failure in the iommu implementation. The tests can be run with commands like: tools/testing/kunit/kunit.py run --build_dir build_kunit_arm64 --arch arm64 --make_options LLVM=-19 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig tools/testing/kunit/kunit.py run --build_dir build_kunit_uml --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig --kconfig_add CONFIG_WERROR=n tools/testing/kunit/kunit.py run --build_dir build_kunit_x86_64 --arch x86_64 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig tools/testing/kunit/kunit.py run --build_dir build_kunit_i386 --arch i386 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig tools/testing/kunit/kunit.py run --build_dir build_kunit_i386pae --arch i386 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig --kconfig_add CONFIG_X86_PAE=y There are several interesting corner cases on the 32 bit platforms that need checking. The format can declare a list of configurations that generate different configurations the initialize the page table, for instance with different top levels or other parameters. The kunit will turn these into "params" which cause each test to run multiple times. The tests are repeated to run at every table level to check that all the item encoding formats work. The following are checked: - Basic init works for each configuration - The various log2 functions have the expected behavior at the limits - pt_compute_best_pgsize() works - pt_table_pa() reads back what pt_install_table() writes - range.max_vasz_lg2 works properly - pt_table_oa_lg2sz() and pt_table_item_lg2sz() use a contiguous non-overlapping set of bits from the VA up to the defined max_va - pt_possible_sizes() and pt_can_have_leaf() produces a sensible layout - pt_item_oa(), pt_entry_oa(), and pt_entry_num_contig_lg2() read back what pt_install_leaf_entry() writes - pt_clear_entry() works - pt_attr_from_entry() reads back what pt_iommu_set_prot() & pt_install_leaf_entry() writes - pt_entry_set_write_clean(), pt_entry_make_write_dirty(), and pt_entry_write_is_dirty() work Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Add read_and_clear_dirty opJason Gunthorpe
IOMMU HW now supports updating a dirty bit in an entry when a DMA writes to the entry's VA range. iommufd has a uAPI to read and clear the dirty bits from the tables. This is a trivial recursive descent algorithm to read and optionally clear the dirty bits. The format needs a function to tell if a contiguous entry is dirty, and a function to clear a contiguous entry back to clean. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Add map_pages opJason Gunthorpe
map is slightly complicated because it has to handle a number of special edge cases: - Overmapping a previously shared, but now empty, table level with an OA. Requries validating and freeing the possibly empty tables - Doing the above across an entire to-be-created contiguous entry - Installing a new shared table level concurrently with another thread - Expanding the table by adding more top levels Table expansion is a unique feature of AMDv1, this version is quite similar except we handle racing concurrent lockless map. The table top pointer and starting level are encoded in a single uintptr_t which ensures we can READ_ONCE() without tearing. Any op will do the READ_ONCE() and use that fixed point as its starting point. Concurrent expansion is handled with a table global spinlock. When inserting a new table entry map checks that the entire portion of the table is empty. This includes freeing any empty lower tables that will be overwritten by an OA. A separate free list is used while checking and collecting all the empty lower tables so that writing the new entry is uninterrupted, either the new entry fully writes or nothing changes. A special fast path for PAGE_SIZE is implemented that does a direct walk to the leaf level and installs a single entry. This gives ~15% improvement for iommu_map() when mapping lists of single pages. This version sits under the iommu_domain_ops as map_pages() but does not require the external page size calculation. The implementation is actually map_range() and can do arbitrary ranges, internally handling all the validation and supporting any arrangment of page sizes. A future series can optimize iommu_map() to take advantage of this. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05iommupt: Add unmap_pages opJason Gunthorpe
unmap_pages removes mappings and any fully contained interior tables from the given range. This follows the now-standard iommu_domain API definition where it does not split up larger page sizes into smaller. The caller must perform unmap only on ranges created by map or it must have somehow otherwise determined safe cut points (eg iommufd/vfio use iova_to_phys to scan for them) A future work will provide 'cut' which explicitly does the page size split if the HW can support it. unmap is implemented with a recursive descent of the tree. If the caller provides a VA range that spans an entire table item then the table memory can be freed as well. If an entire table item can be freed then this version will also check the leaf-only level of the tree to ensure that all entries are present to generate -EINVAL. Many of the existing drivers don't do this extra check. This version sits under the iommu_domain_ops as unmap_pages() but does not require the external page size calculation. The implementation is actually unmap_range() and can do arbitrary ranges, internally handling all the validation and supporting any arrangment of page sizes. A future series can optimize __iommu_unmap() to take advantage of this. Freed page table memory is batched up in the gather and will be freed in the driver's iotlb_sync() callback after the IOTLB flush completes. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>