| Age | Commit message (Collapse) | Author |
|
The cache flush was happening on every level across the whole range of
iteration, even if no leafs or tables were cleared. Instead flush only the
sub range that was actually written.
Overflushing isn't a correctness problem but it does impact the
performance of unmap.
After this series the performance compared to the original VT-d
implementation with cache flushing turned on is:
map_pages
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 253,266 , 213,227 , 6.06
2^21, 246,244 , 221,219 , 0.00
2^30, 231,240 , 209,217 , 3.03
256*2^12, 2604,2668 , 2415,2540 , 4.04
256*2^21, 2495,2824 , 2390,2734 , 12.12
256*2^30, 2542,2845 , 2380,2718 , 12.12
unmap_pages
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 259,292 , 222,251 , 11.11
2^21, 255,259 , 227,236 , 3.03
2^30, 238,254 , 217,230 , 5.05
256*2^12, 2751,2620 , 2417,2437 , 0.00
256*2^21, 2461,2526 , 2377,2423 , 1.01
256*2^30, 2498,2543 , 2370,2404 , 1.01
Fixes: efa03dab7ce4 ("iommupt: Flush the CPU cache after any writes to the page table")
Reported-by: Francois Dugast <francois.dugast@intel.com>
Closes: https://lore.kernel.org/all/20260121130233.257428-1-francois.dugast@intel.com/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Older versions of gcc and clang sometimes get tripped up by the build time
assertion in FIELD_PREP because they can see that the argument to
FIELD_PREP is constant but can't see that the if condition protecting it
is also a constant false.
In file included from <command-line>:
In function 'amdv1pt_install_leaf_entry',
inlined from '__do_map_single_page' at drivers/iommu/generic_pt/fmt/../iommu_pt.h:651:3,
inlined from '__map_single_page0' at drivers/iommu/generic_pt/fmt/../iommu_pt.h:662:1,
inlined from 'pt_descend' at drivers/iommu/generic_pt/fmt/../pt_iter.h:391:9,
inlined from '__do_map_single_page' at drivers/iommu/generic_pt/fmt/../iommu_pt.h:658:10,
inlined from '__map_single_page1.constprop' at drivers/iommu/generic_pt/fmt/../iommu_pt.h:662:1:
././include/linux/compiler_types.h:631:45: error: call to '__compiletime_assert_251' declared with attribute error: FIELD_PREP: value too large for the field
631 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^
././include/linux/compiler_types.h:612:25: note: in definition of macro '__compiletime_assert'
612 | prefix ## suffix(); \
| ^~~~~~
././include/linux/compiler_types.h:631:9: note: in expansion of macro '_compiletime_assert'
631 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
./include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
| ^~~~~~~~~~~~~~~~~~
./include/linux/bitfield.h:69:17: note: in expansion of macro 'BUILD_BUG_ON_MSG'
69 | BUILD_BUG_ON_MSG(__builtin_constant_p(_val) ? \
| ^~~~~~~~~~~~~~~~
./include/linux/bitfield.h:90:17: note: in expansion of macro '__BF_FIELD_CHECK_MASK'
90 | __BF_FIELD_CHECK_MASK(mask, val, pfx); \
| ^~~~~~~~~~~~~~~~~~~~~
./include/linux/bitfield.h:137:17: note: in expansion of macro '__FIELD_PREP'
137 | __FIELD_PREP(_mask, _val, "FIELD_PREP: "); \
| ^~~~~~~~~~~~
drivers/iommu/generic_pt/fmt/amdv1.h:220:26: note: in expansion of macro 'FIELD_PREP'
220 | FIELD_PREP(AMDV1PT_FMT_OA,
| ^~~~~~~~~~
Changing the caller to check pts.level == 0 avoids demanding a bit of
complex reasoning from the compiler that pts.level == level == 0. Instead
the compiler sees that pt_install_leaf_entry() is called with a constant
pts.level == 0 which makes it more reliable to see the constant false in
the if.
Fixes: dcd6a011a8d5 ("iommupt: Add map_pages op")
Reported-by: Chunyu Hu <chuhu@redhat.com>
Closes: https://lore.kernel.org/all/aUn9uGPCooqB-RIF@gmail.com/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
gcc 8.5 on powerpc does not automatically inline these functions even
though they evaluate to constants in key cases. Since the constant
propagation is essential for some code elimination and built-time checks
this causes a build failure:
ERROR: modpost: "__pt_no_sw_bit" [drivers/iommu/generic_pt/fmt/iommu_amdv1.ko] undefined!
Caused by this:
if (pts_feature(&pts, PT_FEAT_DMA_INCOHERENT) &&
!pt_test_sw_bit_acquire(&pts,
SW_BIT_CACHE_FLUSH_DONE))
flush_writes_item(&pts);
Where pts_feature() evaluates to a constant false. Mark them as
__always_inline to force it to evaluate to a constant and trigger the code
elimination.
Fixes: 7c5b184db714 ("genpt: Generic Page Table base API")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202512230720.9y9DtWIo-lkp@intel.com/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The kunit doesn't work since the below commit made GENERIC_PT
unselectable:
$ make ARCH=x86_64 O=build_kunit_x86_64 olddefconfig
ERROR:root:Not all Kconfig options selected in kunitconfig were in the generated .config.
This is probably due to unsatisfied dependencies.
Missing: CONFIG_DEBUG_GENERIC_PT=y, CONFIG_IOMMUFD_TEST=y,
CONFIG_IOMMU_PT_X86_64=y, CONFIG_GENERIC_PT=y, CONFIG_IOMMU_PT_AMDV1=y,
CONFIG_IOMMU_PT_VTDSS=y, CONFIG_IOMMU_PT=y, CONFIG_IOMMU_PT_KUNIT_TEST=y
Also remove the unneeded CONFIG_IOMMUFD_TEST reference as the iommupt kunit
doesn't interact with iommufd, and it doesn't currently build for the
kunit due problems with DMA_SHARED buffer either.
Fixes: 01569c216dde ("genpt: Make GENERIC_PT invisible")
Fixes: 1dd4187f53c3 ("iommupt: Add a kunit test for Generic Page Table")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
syzkaller noticed that with fault injection a failure inside
iommu_alloc_pages_node_sz() oops's in PT_FEAT_DMA_INCOHERENT because it goes
on to make NULL incoherent. Closer inspection shows the return value has
become confused, the alloc routines on the iommupt side expect ERR_PTR while
iommu_alloc_pages_node_sz() returns NULL.
Error out early to fix both issues.
Fixes: aefd967dab64 ("iommupt: Use the incoherent start/stop functions for PT_FEAT_DMA_INCOHERENT")
Fixes: dcd6a011a8d5 ("iommupt: Add map_pages op")
Fixes: cdb39d918579 ("iommupt: Add the basic structure of the iommu implementation")
Reported-by: syzbot+e06bb7478e687f235ad7@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/693a39de.050a0220.4004e.02ce.GAE@google.com/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
If the IOVA is limited to less than 48 the page table will be constructed
with a 3 level configuration which is unsupported by hardware.
Like the second stage the caller needs to pass in both the top_level an
the vasz to specify a table that has more levels than required to hold the
IOVA range.
Fixes: 6cbc09b7719e ("iommu/vt-d: Restore previous domain::aperture_end calculation")
Reported-by: Calvin Owens <calvin@wbinvd.org>
Closes: https://lore.kernel.org/r/8f257d2651eb8a4358fcbd47b0145002e5f1d638.1764237717.git.calvin@wbinvd.org
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
VT-d second stage HW specifies both the maximum IOVA and the supported
table walk starting points. Weirdly there is HW that only supports a 4
level walk but has a maximum IOVA that only needs 3.
The current code miscalculates this and creates a wrongly sized page table
which ultimately fails the compatibility check for number of levels.
This is fixed by allowing the page table to be created with both a vasz
and top_level input. The vasz will set the aperture for the domain while
the top_level will set the page table geometry.
Add top_level to vtdss and correct the logic in VT-d to generate the right
top_level and vasz from mgaw and sagaw.
Fixes: d373449d8e97 ("iommu/vt-d: Use the generic iommu page table")
Reported-by: Calvin Owens <calvin@wbinvd.org>
Closes: https://lore.kernel.org/r/8f257d2651eb8a4358fcbd47b0145002e5f1d638.1764237717.git.calvin@wbinvd.org
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
There is no point in asking the user about the Generic Radix Page
Table API:
- All IOMMU drivers that use this API already select GENERIC_PT when
needed,
- Most users probably do not know what to answer anyway.
Fixes: 7c5b184db7145fd4 ("genpt: Generic Page Table base API")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
gcc 13, in some cases, gets confused if the __builtin_constant_p() is
inside the switch. It thinks that bitnr can have the value max+1 and
fails. Lift the check outside the switch to avoid it.
Fixes: ef7bfe5bbffd ("iommupt/x86: Support SW bits and permit PT_FEAT_DMA_INCOHERENT")
Fixes: 5448c1558f60 ("iommupt: Add the Intel VT-d second stage page table format")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511242012.I7g504Ab-lkp@intel.com/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Since increase_top() does it's own READ_ONCE() on top_of_table, the
caller's prior READ_ONCE() could be inconsistent and the first time
through the loop we may actually already have the right level if two
threads are racing map.
In this case new_level will be left uninitialized.
Further all the exits from the loop have to either commit to the new top
or free any memory allocated so the early return must be a goto err_free.
Make it so the only break from the loop always sets new_level to the right
value and all other exits go to err_free. Use pts.level (the pts
represents the top we are stacking) within the loop instead of new_level.
Fixes: dcd6a011a8d5 ("iommupt: Add map_pages op")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/aRwgNW9PiW2j-Qwo@stanley.mountain
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
description
In review comment for v1 of genpt documentation fixes [1], Randy
suggested that pt_test_sw_bit_acquire() parameters description
should be written using "to read". Commit e4dfaf25df1210 ("iommupt:
Describe @bitnr parameter"), however, misunderstood the review by
instead using "to read" on @bitnr parameter on both
pt_test_sw_bit_acquire() and pt_test_sw_bit_release().
Actually correct the description.
[1]: https://lore.kernel.org/linux-doc/9dba0eb7-6f32-41b7-b70b-12379364585f@infradead.org/
Fixes: e4dfaf25df1210 ("iommupt: Describe @bitnr parameter")
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Fix the include of iommu-pages.h in the KUnit tests for the IOMMU
generic page-table code to make the tests compile again.
Reported-by: Thorsten Leemhuis <linux@leemhuis.info>
Closes: https://lore.kernel.org/all/9844d4cb-f517-478b-9911-b6dc1a963b8e@leemhuis.info/
Reported-by: "Longia, Amandeep Kaur" <amandeepkaur.longia@amd.com>
Closes: https://lore.kernel.org/all/e641c955-25ad-4eae-b3fe-4392966cf768@amd.com/
Fixes: 1dd4187f53c3 ("iommupt: Add a kunit test for Generic Page Table")
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Thorsten Leemhuis <linux@leemhuis.info>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Some adjustments pointed out by Randy:
"decodes an full 64-bit" -> "decodes the full 64 bit"
Correct the function parameter name for iova_to_phys()
Use the recommended section heading style.
Suggested-by: Randy Dunlap <rdunlap@infradead.org>
Fixes: ab0b572847ac ("genpt: Add Documentation/ files")
Fixes: 879ced2bab1b ("iommupt: Add the AMD IOMMU v1 page table format")
Fixes: 9d4c274cd7d5 ("iommupt: Add iova_to_phys op")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Sphinx reports kernel-doc warnings when making htmldocs:
WARNING: ./drivers/iommu/generic_pt/pt_common.h:361 function parameter 'bitnr' not described in 'pt_test_sw_bit_acquire'
WARNING: ./drivers/iommu/generic_pt/pt_common.h:371 function parameter 'bitnr' not described in 'pt_set_sw_bit_release'
Describe @bitnr to squash them.
Fixes: bcc64b57b48e ("iommupt: Add basic support for SW bits in the page table")
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Add some basic checks that the sw_bit APIs work as expected.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
VT-d requires PT_FEAT_DMA_INCOHERENT for the x86 page table as well,
implement the required SW bits and enable the feature.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
AMD and VTD are historically different here, adopt the VTD version of
setting the D bit only on writable PTEs as it makes more sense.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The VT-d second stage format is almost the same as the x86 PAE format,
except the bit encodings in the PTE are different and a few new PTE
features, like force coherency are present.
Among all the formats it is unique in not having a designated present bit.
Comparing the performance of several operations to the existing version:
iommu_map()
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 53,66 , 50,64 , 21.21
2^21, 59,70 , 56,67 , 16.16
2^30, 54,66 , 52,63 , 17.17
256*2^12, 384,524 , 337,516 , 34.34
256*2^21, 387,632 , 336,626 , 46.46
256*2^30, 376,629 , 323,623 , 48.48
iommu_unmap()
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 67,86 , 63,84 , 25.25
2^21, 64,84 , 59,80 , 26.26
2^30, 59,78 , 56,74 , 24.24
256*2^12, 216,335 , 198,317 , 37.37
256*2^21, 245,350 , 232,344 , 32.32
256*2^30, 248,345 , 226,339 , 33.33
Cc: Tina Zhang <tina.zhang@intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Flush the CPU cache for the page table memory after each set of writes to
the page table. The iommu should have visibility to the updated entries as
soon as the map/unmap/etc operations return, like normal coherent hardware
does.
The caches also have to be flushed before any gather can be submitted to
the driver.
Implement the same solution to the race as io-pgtable-arm by using a
software PTE bit to track if a table entry has been flushed or not. If
another thread is still flushing then another concurrent map operation
could return without IOMMU visibility to a required table entry. The SW
bit will tell the second thread to also flush the cache.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
This is the first step to supporting an incoherent walker, start and stop
the incoherence around the allocation and frees of the page table memory.
The iommu_pages API maps this to dma_map/unmap_single(), or arch cache
flushing calls.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
SW bits can be placed on items, including table entries, single OA's and
individual items within a contiguous OA. They are guaranteed to be ignored
by the HW. The API is very basic since the only use case so far is a
single bit.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
This intends to have high coverage of the page table format functions and
the IOMMU implementation itself, exercising the various corner cases.
The kunit tests can be run in the kunit framework, using commands like:
tools/testing/kunit/kunit.py run --build_dir build_kunit_arm64 --arch arm64 --make_options LLVM=-19 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_uml --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_x86_64 --arch x86_64 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_i386 --arch i386 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_i386pae --arch i386 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig --kconfig_add CONFIG_X86_PAE=y
There are several interesting corner cases on the 32 bit platforms that
need checking.
Like the generic tests, these are run on the format's configuration list
using kunit "params". This also checks the core iommu parts of the page
table code as it enters the logic through a mock iommu_domain.
The following are checked:
- PT_FEAT_DYNAMIC_TOP properly adds levels one by one
- Every page size can be iommu_map()'d, and mapping creates that size
- iommu_iova_to_phys() works with every page size
- Test converting OA -> non present -> OA when the two OAs overlap and
free table levels
- Test that unmap stops at holes, unmap doesn't split, and unmap returns
the right values for partial unmap requests
- Randomly map/unmap. Checks map with random sizes, that map fails when
hitting collisions doing nothing, unmap/map with random intersections and
full unmap of random sizes. Also checks iommu_iova_to_phys() with random
sizes
- Check for memory leaks by monitoring NR_SECONDARY_PAGETABLE
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
This is used by x86 CPUs and can be used in AMD/VT-d x86 IOMMUs. When a
x86 IOMMU is running SVA the MM will be using this format.
This implementation follows the AMD v2 io-pgtable version.
There is nothing remarkable here, the format can have 4 or 5 levels and
limited support for different page sizes. No contiguous pages support.
x86 uses a sign extension mechanism where the top bits of the VA must
match the sign bit. The core code supports this through
PT_FEAT_SIGN_EXTEND which creates and upper and lower VA range. All the
new operations will work correctly in both spaces, however currently there
is no way to report the upper space to other layers. Future patches can
improve that.
In principle this can support 3 page tables levels matching the 32 bit PAE
table format, but no iommu driver needs this. The focus is on the modern
64 bit 4 and 5 level formats.
Comparing the performance of several operations to the existing version:
iommu_map()
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 71,61 , 66,58 , -13.13
2^21, 66,60 , 61,55 , -10.10
2^30, 59,56 , 56,54 , -3.03
256*2^12, 392,1360 , 345,1289 , 73.73
256*2^21, 383,1159 , 335,1145 , 70.70
256*2^30, 378,965 , 331,892 , 62.62
iommu_unmap()
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 77,71 , 73,68 , -7.07
2^21, 76,70 , 70,66 , -6.06
2^30, 69,66 , 66,63 , -4.04
256*2^12, 225,899 , 210,870 , 75.75
256*2^21, 262,722 , 248,710 , 65.65
256*2^30, 251,643 , 244,634 , 61.61
The small -ve values in the iommu_unmap() are due to the core code calling
iommu_pgsize() before invoking the domain op. This is unncessary with this
implementation. Future work optimizes this and gets to 2%, 4%, 3%.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The iommufd self test uses an xarray to store the pfns and their orders to
emulate a page table. Make it act more like a real iommu driver by
replacing the xarray with an iommupt based page table. The new AMDv1 mock
format behaves similarly to the xarray.
Add set_dirty() as a iommu_pt operation to allow the test suite to
simulate HW dirty.
Userspace can select between several formats including the normal AMDv1
format and a special MOCK_IOMMUPT_HUGE variation for testing huge page
dirty tracking. To make the dirty tracking test work the page table must
only store exactly 2M huge pages otherwise the logic the test uses
fails. They cannot be broken up or combined.
Aside from aligning the selftest with a real page table implementation,
this helps test the iommupt code itself.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The iommufd self test uses an xarray to store the pfns and their orders to
emulate a page table. Slightly modify the amdv1 page table to create a
real page table that has similar properties:
- 2k base granule to simulate something like a 4k page table on a 64K
PAGE_SIZE ARM system
- Contiguous page support for every PFN order
- Dirty tracking
AMDv1 is the closest format, as it is the only one that already supports
every page size. Tweak it to have only 5 levels and an 11 bit base granule
and compile it separately as a format variant.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
This intends to have high coverage of the page table format functions, it
uses the IOMMU implementation to create a tree which it then walks through
and directly calls the generic page table functions to test them.
It is a good starting point to test a new format header as it is often
able to find typos and inconsistencies much more directly, rather than
with an obscure failure in the iommu implementation.
The tests can be run with commands like:
tools/testing/kunit/kunit.py run --build_dir build_kunit_arm64 --arch arm64 --make_options LLVM=-19 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_uml --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig --kconfig_add CONFIG_WERROR=n
tools/testing/kunit/kunit.py run --build_dir build_kunit_x86_64 --arch x86_64 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_i386 --arch i386 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_i386pae --arch i386 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig --kconfig_add CONFIG_X86_PAE=y
There are several interesting corner cases on the 32 bit platforms that
need checking.
The format can declare a list of configurations that generate different
configurations the initialize the page table, for instance with different
top levels or other parameters. The kunit will turn these into "params"
which cause each test to run multiple times.
The tests are repeated to run at every table level to check that all the
item encoding formats work.
The following are checked:
- Basic init works for each configuration
- The various log2 functions have the expected behavior at the limits
- pt_compute_best_pgsize() works
- pt_table_pa() reads back what pt_install_table() writes
- range.max_vasz_lg2 works properly
- pt_table_oa_lg2sz() and pt_table_item_lg2sz() use a contiguous
non-overlapping set of bits from the VA up to the defined max_va
- pt_possible_sizes() and pt_can_have_leaf() produces a sensible layout
- pt_item_oa(), pt_entry_oa(), and pt_entry_num_contig_lg2() read back
what pt_install_leaf_entry() writes
- pt_clear_entry() works
- pt_attr_from_entry() reads back what pt_iommu_set_prot() &
pt_install_leaf_entry() writes
- pt_entry_set_write_clean(), pt_entry_make_write_dirty(), and
pt_entry_write_is_dirty() work
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
IOMMU HW now supports updating a dirty bit in an entry when a DMA writes
to the entry's VA range. iommufd has a uAPI to read and clear the dirty
bits from the tables.
This is a trivial recursive descent algorithm to read and optionally clear
the dirty bits. The format needs a function to tell if a contiguous entry
is dirty, and a function to clear a contiguous entry back to clean.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
map is slightly complicated because it has to handle a number of special
edge cases:
- Overmapping a previously shared, but now empty, table level with an OA.
Requries validating and freeing the possibly empty tables
- Doing the above across an entire to-be-created contiguous entry
- Installing a new shared table level concurrently with another thread
- Expanding the table by adding more top levels
Table expansion is a unique feature of AMDv1, this version is quite
similar except we handle racing concurrent lockless map. The table top
pointer and starting level are encoded in a single uintptr_t which ensures
we can READ_ONCE() without tearing. Any op will do the READ_ONCE() and use
that fixed point as its starting point. Concurrent expansion is handled
with a table global spinlock.
When inserting a new table entry map checks that the entire portion of the
table is empty. This includes freeing any empty lower tables that will be
overwritten by an OA. A separate free list is used while checking and
collecting all the empty lower tables so that writing the new entry is
uninterrupted, either the new entry fully writes or nothing changes.
A special fast path for PAGE_SIZE is implemented that does a direct walk
to the leaf level and installs a single entry. This gives ~15% improvement
for iommu_map() when mapping lists of single pages.
This version sits under the iommu_domain_ops as map_pages() but does not
require the external page size calculation. The implementation is actually
map_range() and can do arbitrary ranges, internally handling all the
validation and supporting any arrangment of page sizes. A future series
can optimize iommu_map() to take advantage of this.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
unmap_pages removes mappings and any fully contained interior tables from
the given range. This follows the now-standard iommu_domain API definition
where it does not split up larger page sizes into smaller. The caller must
perform unmap only on ranges created by map or it must have somehow
otherwise determined safe cut points (eg iommufd/vfio use iova_to_phys to
scan for them)
A future work will provide 'cut' which explicitly does the page size split
if the HW can support it.
unmap is implemented with a recursive descent of the tree. If the caller
provides a VA range that spans an entire table item then the table memory
can be freed as well.
If an entire table item can be freed then this version will also check the
leaf-only level of the tree to ensure that all entries are present to
generate -EINVAL. Many of the existing drivers don't do this extra check.
This version sits under the iommu_domain_ops as unmap_pages() but does not
require the external page size calculation. The implementation is actually
unmap_range() and can do arbitrary ranges, internally handling all the
validation and supporting any arrangment of page sizes. A future series
can optimize __iommu_unmap() to take advantage of this.
Freed page table memory is batched up in the gather and will be freed in
the driver's iotlb_sync() callback after the IOTLB flush completes.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
iova_to_phys is a performance path for the DMA API and iommufd, implement
it using an unrolled get_user_pages() like function waterfall scheme.
The implementation itself is fairly trivial.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
AMD IOMMU v1 is unique in supporting contiguous pages with a variable size
and it can decode the full 64 bit VA space. Unlike other x86 page tables
this explicitly does not do sign extension as part of allowing the entire
64 bit VA space to be supported.
The general design is quite similar to the x86 PAE format, except with a
6th level and quite different PTE encoding.
This format is the only one that uses the PT_FEAT_DYNAMIC_TOP feature in
the existing code as the existing AMDv1 code starts out with a 3 level
table and adds levels on the fly if more IOVA is needed.
Comparing the performance of several operations to the existing version:
iommu_map()
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 65,64 , 62,61 , -1.01
2^13, 70,66 , 67,62 , -8.08
2^14, 73,69 , 71,65 , -9.09
2^15, 78,75 , 75,71 , -5.05
2^16, 89,89 , 86,84 , -2.02
2^17, 128,121 , 124,112 , -10.10
2^18, 175,175 , 170,163 , -4.04
2^19, 264,306 , 261,279 , 6.06
2^20, 444,525 , 438,489 , 10.10
2^21, 60,62 , 58,59 , 1.01
256*2^12, 381,1833 , 367,1795 , 79.79
256*2^21, 375,1623 , 356,1555 , 77.77
256*2^30, 356,1338 , 349,1277 , 72.72
iommu_unmap()
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 76,89 , 71,86 , 17.17
2^13, 79,89 , 75,86 , 12.12
2^14, 78,90 , 74,86 , 13.13
2^15, 82,89 , 74,86 , 13.13
2^16, 79,89 , 74,86 , 13.13
2^17, 81,89 , 77,87 , 11.11
2^18, 90,92 , 87,89 , 2.02
2^19, 91,93 , 88,90 , 2.02
2^20, 96,95 , 91,92 , 1.01
2^21, 72,88 , 68,85 , 20.20
256*2^12, 372,6583 , 364,6251 , 94.94
256*2^21, 398,6032 , 392,5758 , 93.93
256*2^30, 396,5665 , 389,5258 , 92.92
The ~5-17x speedup when working with mutli-PTE map/unmaps is because the
AMD implementation rewalks the entire table on every new PTE while this
version retains its position. The same speedup will be seen with dirtys as
well.
The old implementation triggers a compiler optimization that ends up
generating a "rep stos" memset for contiguous PTEs. Since AMD can have
contiguous PTEs that span 2Kbytes of table this is a huge win compared to
a normal movq loop. It is why the unmap side has a fairly flat runtime as
the contiguous PTE sides increases. This version makes it explicit with a
memset64() call.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The existing IOMMU page table implementations duplicate all of the working
algorithms for each format. By using the generic page table API a single C
version of the IOMMU algorithms can be created and re-used for all of the
different formats used in the drivers. The implementation will provide a
single C version of the iommu domain operations: iova_to_phys, map, unmap,
and read_and_clear_dirty.
Further, adding new algorithms and techniques becomes easy to do across
the entire fleet of drivers and formats.
The C functions are drop in compatible with the existing iommu_domain_ops
using the IOMMU_PT_DOMAIN_OPS() macro. Each per-format implementation
compilation unit will produce exported symbols following the pattern
pt_iommu_FMT_map_pages() which the macro directly maps to the
iommu_domain_ops members. This avoids the additional function pointer
indirection like io-pgtable has.
The top level struct used by the drivers is pt_iommu_table_FMT. It
contains the other structs to allow container_of() to move between the
driver, iommu page table, generic page table, and generic format layers.
struct pt_iommu_table_amdv1 {
struct pt_iommu {
struct iommu_domain domain;
} iommu;
struct pt_amdv1 {
struct pt_common common;
} amdpt;
};
The driver is expected to union the pt_iommu_table_FMT with its own
existing domain struct:
struct driver_domain {
union {
struct iommu_domain domain;
struct pt_iommu_table_amdv1 amdv1;
};
};
PT_IOMMU_CHECK_DOMAIN(struct driver_domain, amdv1, domain);
To create an alias to avoid renaming 'domain' in a lot of driver code.
This allows all the layers to access all the necessary functions to
implement their different roles with no change to any of the existing
iommu core code.
Implement the basic starting point: pt_iommu_init(), get_info() and
deinit().
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The generic API is intended to be separated from the implementation of
page table algorithms. It contains only accessors for walking and
manipulating the table and helpers that are useful for building an
implementation. Memory management is not in the generic API, but part of
the implementation.
Using a multi-compilation approach the implementation module would include
headers in this order:
common.h
defs_FMT.h
pt_defs.h
FMT.h
pt_common.h
IMPLEMENTATION.h
Where each compilation unit would have a combination of FMT and
IMPLEMENTATION to produce a per-format per-implementation module.
The API is designed so that the format headers have minimal logic, and
default implementations are provided if the format doesn't include one.
Generally formats provide their code via an inline function using the
pattern:
static inline FMTpt_XX(..) {}
#define pt_XX FMTpt_XX
The common code then enforces a function signature so that there is no
drift in function arguments, or accidental polymorphic functions (as has
been slightly troublesome in mm). Use of function-like #defines are
avoided in the format even though many of the functions are small enough.
Provide kdocs for the API surface.
This is enough to implement the 8 initial format variations with all of
their features:
* Entries comprised of contiguous blocks of IO PTEs for larger page
sizes (AMDv1, ARMv8)
* Multi-level tables, up to 6 levels. Runtime selected top level
* The size of the top table level can be selected at runtime (ARM's
concatenated tables)
* The number of levels in the table can optionally increase dynamically
during map (AMDv1)
* Optional leaf entries at any level
* 32 bit/64 bit virtual and output addresses, using every bit
* Sign extended addressing (x86)
* Dirty tracking
A basic simple format takes about 200 lines to declare the require inline
functions.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|