Age | Commit message (Collapse) | Author |
|
[ Upstream commit 28a68387888997e8a7fa57940ea5d55f2e16b594 ]
If the IOMMU group setup fails, the reset module is not released.
Fixes: b5add544d677d363 ("vfio, platform: make reset driver a requirement by default")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Acked-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0e714d27786ce1fb3efa9aac58abc096e68b1c2a upstream.
info.index can be indirectly controlled by user-space, hence leading
to a potential exploitation of the Spectre variant 1 vulnerability.
This issue was detected with the help of Smatch:
drivers/vfio/pci/vfio_pci.c:734 vfio_pci_ioctl()
warn: potential spectre issue 'vdev->region'
Fix this by sanitizing info.index before indirectly using it to index
vdev->region
Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].
[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2
Cc: stable@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit cf0d53ba4947aad6e471491d5b20a567cbe92e56 upstream.
MRRS defines the maximum read request size a device is allowed to
make. Drivers will often increase this to allow more data transfer
with a single request. Completions to this request are bound by the
MPS setting for the bus. Aside from device quirks (none known), it
doesn't seem to make sense to set an MRRS value less than MPS, yet
this is a likely scenario given that user drivers do not have a
system-wide view of the PCI topology. Virtualize MRRS such that the
user can set MRRS >= MPS, but use MPS as the floor value that we'll
write to hardware.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 3393af24b665cb0aea7353b05e522b03ab1e7d73 ]
This adds missing checking for kzalloc() return value.
Fixes: 4b6fad7097f8 ("powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 1282ba7fc28dbc66c3f0e4aaafaaa228361d1ae5 ]
The existing SPAPR TCE driver advertises both VFIO_SPAPR_TCE_IOMMU and
VFIO_SPAPR_TCE_v2_IOMMU types to the userspace and the userspace usually
picks the v2.
Normally the userspace would create a container, attach an IOMMU group
to it and only then set the IOMMU type (which would normally be v2).
However a specific IOMMU group may not support v2, in other words
it may not implement set_window/unset_window/take_ownership/
release_ownership and such a group should not be attached to
a v2 container.
This adds extra checks that a new group can do what the selected IOMMU
type suggests. The userspace can then test the return value from
ioctl(VFIO_SET_IOMMU, VFIO_SPAPR_TCE_v2_IOMMU) and try
VFIO_SPAPR_TCE_IOMMU.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 523184972b282cd9ca17a76f6ca4742394856818 ]
With virtual PCI-Express chipsets, we now see userspace/guest drivers
trying to match the physical MPS setting to a virtual downstream port.
Of course a lone physical device surrounded by virtual interconnects
cannot make a correct decision for a proper MPS setting. Instead,
let's virtualize the MPS control register so that writes through to
hardware are disallowed. Userspace drivers like QEMU assume they can
write anything to the device and we'll filter out anything dangerous.
Since mismatched MPS can lead to AER and other faults, let's add it
to the kernel side rather than relying on userspace virtualization to
handle it.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 2da64d20a0b20046d688e44f4033efd09157e29d ]
Commit d9c728949ddc ("vfio/spapr: Postpone default window creation")
added an additional exit to the VFIO_IOMMU_SPAPR_TCE_CREATE case and
made it possible to return from tce_iommu_ioctl() without unlocking
container->lock; this fixes the issue.
Fixes: d9c728949ddc ("vfio/spapr: Postpone default window creation")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit e19f32da5ded958238eac1bbe001192acef191a2 ]
Here, pci_iomap can fail, handle this case release selected
pci regions and return -ENOMEM.
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 45e869714489431625c569d21fc952428d761476 ]
Using ancient compilers (gcc-4.5 or older) on ARM, we get a link
failure with the vfio-pci driver:
ERROR: "__aeabi_lcmp" [drivers/vfio/pci/vfio-pci.ko] undefined!
The reason is that the compiler tries to do a comparison of
a 64-bit range. This changes it to convert to a 32-bit number
explicitly first, as newer compilers do for themselves.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5d6dee80a1e94cc284d03e06d930e60e8d3ecf7d upstream.
At the point where the kvm-vfio pseudo device wants to release its
vfio group reference, we can't always acquire a new reference to make
that happen. The group can be in a state where we wouldn't allow a
new reference to be added. This new helper function allows a caller
to match a file to a group to facilitate this. Given a file and
group, report if they match. Thus the caller needs to already have a
group reference to match to the file. This allows the deletion of a
group without acquiring a new reference.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 811642d8d8a82c0cce8dc2debfdaf23c5a144839 upstream.
If vfio_iommu_group_notifier() acquires a group reference and that
reference becomes the last reference to the group, then vfio_group_put
introduces a deadlock code path where we're trying to unregister from
the iommu notifier chain from within a callout of that chain. Use a
work_struct to release this reference asynchronously.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit bd00fdf198e2da475a2f4265a83686ab42d998a8 ]
The recently added mediated VFIO driver doesn't know about powerpc iommu.
It thus doesn't register a struct iommu_table_group in the iommu group
upon device creation. The iommu_data pointer hence remains null.
This causes a kernel oops when userspace tries to set the iommu type of a
container associated with a mediated device to VFIO_SPAPR_TCE_v2_IOMMU.
[ 82.585440] mtty mtty: MDEV: Registered
[ 87.655522] iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 10
[ 87.655527] vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 10
[ 116.297184] Unable to handle kernel paging request for data at address 0x00000030
[ 116.297389] Faulting instruction address: 0xd000000007870524
[ 116.297465] Oops: Kernel access of bad area, sig: 11 [#1]
[ 116.297611] SMP NR_CPUS=2048
[ 116.297611] NUMA
[ 116.297627] PowerNV
...
[ 116.297954] CPU: 33 PID: 7067 Comm: qemu-system-ppc Not tainted 4.10.0-rc5-mdev-test #8
[ 116.297993] task: c000000e7718b680 task.stack: c000000e77214000
[ 116.298025] NIP: d000000007870524 LR: d000000007870518 CTR: 0000000000000000
[ 116.298064] REGS: c000000e77217990 TRAP: 0300 Not tainted (4.10.0-rc5-mdev-test)
[ 116.298103] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
[ 116.298107] CR: 84004444 XER: 00000000
[ 116.298154] CFAR: c00000000000888c DAR: 0000000000000030 DSISR: 40000000 SOFTE: 1
GPR00: d000000007870518 c000000e77217c10 d00000000787b0ed c000000eed2103c0
GPR04: 0000000000000000 0000000000000000 c000000eed2103e0 0000000f24320000
GPR08: 0000000000000104 0000000000000001 0000000000000000 d0000000078729b0
GPR12: c00000000025b7e0 c00000000fe08400 0000000000000001 000001002d31d100
GPR16: 000001002c22c850 00003ffff315c750 0000000043145680 0000000043141bc0
GPR20: ffffffffffffffed fffffffffffff000 0000000020003b65 d000000007706018
GPR24: c000000f16cf0d98 d000000007706000 c000000003f42980 c000000003f42980
GPR28: c000000f1575ac00 c000000003f429c8 0000000000000000 c000000eed2103c0
[ 116.298504] NIP [d000000007870524] tce_iommu_attach_group+0x10c/0x360 [vfio_iommu_spapr_tce]
[ 116.298555] LR [d000000007870518] tce_iommu_attach_group+0x100/0x360 [vfio_iommu_spapr_tce]
[ 116.298601] Call Trace:
[ 116.298610] [c000000e77217c10] [d000000007870518] tce_iommu_attach_group+0x100/0x360 [vfio_iommu_spapr_tce] (unreliable)
[ 116.298671] [c000000e77217cb0] [d0000000077033a0] vfio_fops_unl_ioctl+0x278/0x3e0 [vfio]
[ 116.298713] [c000000e77217d40] [c0000000002a3ebc] do_vfs_ioctl+0xcc/0x8b0
[ 116.298745] [c000000e77217de0] [c0000000002a4700] SyS_ioctl+0x60/0xc0
[ 116.298782] [c000000e77217e30] [c00000000000b220] system_call+0x38/0xfc
[ 116.298812] Instruction dump:
[ 116.298828] 7d3f4b78 409effc8 3d220000 e9298020 3c800140 38a00018 608480c0 e8690028
[ 116.298869] 4800249d e8410018 7c7f1b79 41820230 <e93e0030> 2fa90000 419e0114 e9090020
[ 116.298914] ---[ end trace 1e10b0ced08b9120 ]---
This patch fixes the oops.
Reported-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 930a42ded3fede7ca3acafc9153f4f2d0f56a92c ]
If a container already has a group attached, attaching a new group
should just program already created IOMMU tables to the hardware via
the iommu_table_group_ops::set_window() callback.
However commit 6f01cc692a16 ("vfio/spapr: Add a helper to create
default DMA window") did not just simplify the code but also removed
the set_window() calls in the case of attaching groups to a container
which already has tables so it broke VFIO PCI hotplug.
This reverts set_window() bits in tce_iommu_take_ownership_ddw().
Fixes: 6f01cc692a16 ("vfio/spapr: Add a helper to create default DMA window")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0cfef2b7410b64d7a430947e0b533314c4f97153 upstream.
If the mmap_sem is contented then the vfio type1 IOMMU backend will
defer locked page accounting updates to a workqueue task. This has a
few problems and depending on which side the user tries to play, they
might be over-penalized for unmaps that haven't yet been accounted or
race the workqueue to enter more mappings than they're allowed. The
original intent of this workqueue mechanism seems to be focused on
reducing latency through the ioctl, but we cannot do so at the cost
of correctness. Remove this workqueue mechanism and update the
callers to allow for failure. We can also now recheck the limit under
write lock to make sure we don't exceed it.
vfio_pin_pages_remote() also now necessarily includes an unwind path
which we can jump to directly if the consecutive page pinning finds
that we're exceeding the user's memory limits. This avoids the
current lazy approach which does accounting and mapping up to the
fault, only to return an error on the next iteration to unwind the
entire vfio_dma.
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit d9c728949ddc9de5734bf3b12ea906ca8a77f2a0 ]
We are going to allow the userspace to configure container in
one memory context and pass container fd to another so
we are postponing memory allocations accounted against
the locked memory limit. One of previous patches took care of
it_userspace.
At the moment we create the default DMA window when the first group is
attached to a container; this is done for the userspace which is not
DDW-aware but familiar with the SPAPR TCE IOMMU v2 in the part of memory
pre-registration - such client expects the default DMA window to exist.
This postpones the default DMA window allocation till one of
the folliwing happens:
1. first map/unmap request arrives;
2. new window is requested;
This adds noop for the case when the userspace requested removal
of the default window which has not been created yet.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 6f01cc692a16405235d5c34056455b182682123c ]
There is already a helper to create a DMA window which does allocate
a table and programs it to the IOMMU group. However
tce_iommu_take_ownership_ddw() did not use it and did these 2 calls
itself to simplify error path.
Since we are going to delay the default window creation till
the default window is accessed/removed or new window is added,
we need a helper to create a default window from all these cases.
This adds tce_iommu_create_default_window(). Since it relies on
a VFIO container to have at least one IOMMU group (for future use),
this changes tce_iommu_attach_group() to add a group to the container
first and then call the new helper.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 4b6fad7097f883335b6d9627c883cb7f276d94c9 ]
At the moment the userspace tool is expected to request pinning of
the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
When the userspace process finishes, all the pinned pages need to
be put; this is done as a part of the userspace memory context (MM)
destruction which happens on the very last mmdrop().
This approach has a problem that a MM of the userspace process
may live longer than the userspace process itself as kernel threads
use userspace process MMs which was runnning on a CPU where
the kernel thread was scheduled to. If this happened, the MM remains
referenced until this exact kernel thread wakes up again
and releases the very last reference to the MM, on an idle system this
can take even hours.
This moves preregistered regions tracking from MM to VFIO; insteads of
using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
added so each container releases regions which it has pre-registered.
This changes the userspace interface to return EBUSY if a memory
region is already registered in a container. However it should not
have any practical effect as the only userspace tool available now
does register memory region once per container anyway.
As tce_iommu_register_pages/tce_iommu_unregister_pages are called
under container->lock, this does not need additional locking.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit bc82d122ae4a0e9f971f13403995898fcfa0c09e ]
In some situations the userspace memory context may live longer than
the userspace process itself so if we need to do proper memory context
cleanup, we better have tce_container take a reference to mm_struct and
use it later when the process is gone (@current or @current->mm is NULL).
This references mm and stores the pointer in the container; this is done
in a new helper - tce_iommu_mm_set() - when one of the following happens:
- a container is enabled (IOMMU v1);
- a first attempt to pre-register memory is made (IOMMU v2);
- a DMA window is created (IOMMU v2).
The @mm stays referenced till the container is destroyed.
This replaces current->mm with container->mm everywhere except debug
prints.
This adds a check that current->mm is the same as the one stored in
the container to prevent userspace from making changes to a memory
context of other processes.
DMA map/unmap ioctls() do not check for @mm as they already check
for @enabled which is set after tce_iommu_mm_set() is called.
This does not reference a task as multiple threads within the same mm
are allowed to ioctl() to vfio and supposedly they will have same limits
and capabilities and if they do not, we'll just fail with no harm made.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit d7baee6901b34c4895eb78efdbf13a49079d7404 ]
This changes mm_iommu_xxx helpers to take mm_struct as a parameter
instead of getting it from @current which in some situations may
not have a valid reference to mm.
This changes helpers to receive @mm and moves all references to @current
to the caller, including checks for !current and !current->mm;
checks in mm_iommu_preregistered() are removed as there is no caller
yet.
This moves the mm_iommu_adjust_locked_vm() call to the caller as
it receives mm_iommu_table_group_mem_t but it needs mm.
This should cause no behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 39701e56f5f16ea0cf8fc9e8472e645f8de91d23 ]
The iommu_table struct manages a hardware TCE table and a vmalloc'd
table with corresponding userspace addresses. Both are allocated when
the default DMA window is created and this happens when the very first
group is attached to a container.
As we are going to allow the userspace to configure container in one
memory context and pas container fd to another, we have to postpones
such allocations till a container fd is passed to the destination
user process so we would account locked memory limit against the actual
container user constrainsts.
This postpones the it_userspace array allocation till it is used first
time for mapping. The unmapping patch already checks if the array is
allocated.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The VFIO_DEVICE_SET_IRQS ioctl did not sufficiently sanitize
user-supplied integers, potentially allowing memory corruption. This
patch adds appropriate integer overflow checks, checks the range bounds
for VFIO_IRQ_SET_DATA_NONE, and also verifies that only single element
in the VFIO_IRQ_SET_DATA_TYPE_MASK bitmask is set.
VFIO_IRQ_SET_ACTION_TYPE_MASK is already correctly checked later in
vfio_pci_set_irqs_ioctl().
Furthermore, a kzalloc is changed to a kcalloc because the use of a
kzalloc with an integer multiplication allowed an integer overflow
condition to be reached without this patch. kcalloc checks for overflow
and should prevent a similar occurrence.
Signed-off-by: Vlad Tsyrklevich <vlad@tsyrklevich.net>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Simplify the interrupt setup by using the new PCI layer helpers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
The MSI/X shutdown path can gratuitously enable INTx, which is not
something we want to happen if we're dealing with broken INTx device.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
We use a BAR restore trick to try to detect when a user has performed
a device reset, possibly through FLR or other backdoors, to put things
back into a working state. This is important for backdoor resets, but
we can actually just virtualize the "front door" resets provided via
PCIe and AF FLR. Set these bits as virtualized + writable, allowing
the default write to set them in vconfig, then we can simply check the
bit, perform an FLR of our own, and clear the bit. We don't actually
have the granularity in PCI to specify the type of reset we want to
do, but generally devices don't implement both PCIe and AF FLR and
we'll favor these over other types of reset, so we should generally
lineup. We do test whether the device provides the requested FLR type
to stay consistent with hardware capabilities though.
This seems to fix several instance of devices getting into bad states
with userspace drivers, like dpdk, running inside a VM.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Greg Rose <grose@lightfleet.com>
|
|
We get a few warnings when building kernel with W=1:
drivers/vfio/platform/vfio_platform_common.c:76:5: warning: no previous prototype for 'vfio_platform_acpi_call_reset' [-Wmissing-prototypes]
drivers/vfio/platform/vfio_platform_common.c:98:6: warning: no previous prototype for 'vfio_platform_acpi_has_reset' [-Wmissing-prototypes]
drivers/vfio/platform/vfio_platform_common.c:640:5: warning: no previous prototype for 'vfio_platform_of_probe' [-Wmissing-prototypes]
drivers/vfio/platform/reset/vfio_platform_amdxgbe.c:59:5: warning: no previous prototype for 'vfio_platform_amdxgbe_reset' [-Wmissing-prototypes]
drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c:60:5: warning: no previous prototype for 'vfio_platform_calxedaxgmac_reset' [-Wmissing-prototypes]
....
In fact, these functions are only used in the file in which they are
declared and don't need a declaration, but can be made static.
so this patch marks these functions with 'static'.
Signed-off-by: Baoyou Xie <baoyou.xie@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Signed-off-by: Wei Jiangang <weijg.fnst@cn.fujitsu.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
There are multiple cases in vfio_pci_set_ctx_trigger_single() where
we assume we can safely read from our data pointer without actually
checking whether the user has passed any data via the count field.
VFIO_IRQ_SET_DATA_NONE in particular is entirely broken since we
attempt to pull an int32_t file descriptor out before even checking
the data type. The other data types assume the data pointer contains
one element of their type as well.
In part this is good news because we were previously restricted from
doing much sanitization of parameters because it was missed in the
past and we didn't want to break existing users. Clearly DATA_NONE
is completely broken, so it must not have any users and we can fix
it up completely. For DATA_BOOL and DATA_EVENTFD, we'll just
protect ourselves, returning error when count is zero since we
previously would have oopsed.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reported-by: Chris Thompson <the_cartographer@hotmail.com>
Cc: stable@vger.kernel.org
Reviewed-by: Eric Auger <eric.auger@redhat.com>
|
|
Release call is ignoring the return code from reset call and can
potentially continue even though reset call failed.
If reset_required module parameter is set, this patch is going
to validate the return code and will cause stack dump with
WARN_ON and warn the user of failure.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Open call is ignoring the return code from reset call and can
potentially continue even though reset call failed.
If reset_required module parameter is set, this patch is going
to validate the return code and will abort open if reset fails.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
The code was allowing platform devices to be used without a supporting
VFIO reset driver. The hardware can be left in some inconsistent state
after a guest machine abort.
The reset driver will put the hardware back to safe state and disable
interrupts before returning the control back to the host machine.
Adding a new reset_required kernel module option to platform VFIO drivers.
The default value is true for the DT and ACPI based drivers.
The reset requirement value for AMBA drivers is set to false and is
unchangeable to maintain the existing functionality.
New requirements are:
1. A reset function needs to be implemented by the corresponding driver
via DT/ACPI.
2. The reset function needs to be discovered via DT/ACPI.
The probe of the driver will fail if any of the above conditions are
not satisfied.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
The device tree code checks for the presence of a reset driver and calls
the of_reset function pointer by looking up the reset driver as a module.
ACPI defines _RST method to perform device level reset. After the _RST
method is executed, the OS can resume using the device. _RST method is
expected to stop DMA transfers and IRQs.
This patch introduces two functions as vfio_platform_acpi_has_reset and
vfio_platform_acpi_call_reset. The has reset method is used to declare
reset capability via the ioctl flag VFIO_DEVICE_FLAGS_RESET. The call
reset function is used to execute the _RST ACPI method.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Getting ready to bring out extra debug information to the caller
so that more verbose information can be printed when an error is
observed.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
The code is using the compatible DT string to associate a reset driver
with the actual device itself. The compatible string does not exist on
ACPI based systems. HID is the unique identifier for a device driver
instead.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Creating a new function to determine if this driver supports reset
function or not. This is an attempt to abstract device tree calls
from the rest of the code.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
The reset call sequence seems to replicate itself multiple times
across the file. Grouping them together for maintenance reasons.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Renaming the reset function to of_reset as it is only used
by the device tree based platforms.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
The vfio group should be released after
the vfio_group_try_dissolve_container call.
The code should not rely on someone else to hold
a reference on the group.
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Current vfio-pci implementation disallows to mmap
sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
page may be shared with other BARs. This will cause some
performance issues when we passthrough a PCI device with
this kind of BARs. Guest will be not able to handle the mmio
accesses to the BARs which leads to mmio emulations in host.
However, not all sub-page BARs will share page with other BARs.
We should allow to mmap the sub-page MMIO BARs which we can
make sure will not share page with other BARs.
This patch adds support for this case. And we try to add a
dummy resource to reserve the remainder of the page which
hot-add device's BAR might be assigned into. But it's not
necessary to handle the case when the BAR is not page aligned.
Because we can't expect the BAR will be assigned into the same
location in a page in guest when we passthrough the BAR. And
it's hard to access this BAR in userspace because we have
no way to get the BAR's location in a page.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
The vfio No-IOMMU mode was supported by this
'commit 03a76b60f8ba2797 ("vfio: Include No-IOMMU mode")',
but it only support vfio-pci.
Using vfio_iommu_group_get/put, but not iommu_group_get/put,
the platform devices can be exposed to userspace with
CONFIG_VFIO_NOIOMMU and the "enable_unsafe_noiommu_mode"
option enabled.
From 'commit 03a76b60f8ba2797 ("vfio: Include No-IOMMU mode")',
"This should make it very clear that this mode is not safe.
Additionally, CAP_SYS_RAWIO privileges are necessary to work
with groups and containers using this mode. Groups making
use of this support are named /dev/vfio/noiommu-$GROUP and
can only make use of the special VFIO_NOIOMMU_IOMMU for the
container. Use of this mode, specifically binding a device
without a native IOMMU group to a VFIO bus driver will taint
the kernel and should therefore not be considered supported."
Signed-off-by: Peng Fan <van.freenix@gmail.com>
Cc: Eric Auger <eric.auger@linaro.org>
Cc: Baptiste Reynal <b.reynal@virtualopensystems.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
The size of the VPD area is not necessarily 4-byte aligned, so a
pci_vpd_read() might return less than 4 bytes. Zero our buffer and
accept anything other than an error. Intel X710 NICs exercise this.
Fixes: 4e1a635552d3 ("vfio/pci: Use kernel VPD access functions")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
This function cannot actually be called with npage = 0, so in practice
this doesn't return an uninitialized value.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Both the INTx and MSI/X disable paths do an eventfd_ctx_put() for the
trigger eventfd before calling vfio_virqfd_disable() any potential
mask and unmask eventfds. This opens a use-after-free race where an
inopportune irqfd can reference the freed signalling eventfd. Reorder
to avoid this possibility.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Pull VFIO updates from Alex Williamson:
- Hide INTx on certain known broken devices (Alex Williamson)
- Additional backdoor reset detection (Alex Williamson)
- Remove unused iommudata reference (Alexey Kardashevskiy)
- Use cfg_size to avoid probing extended config space (Alexey
Kardashevskiy)
* tag 'vfio-v4.7-rc1' of git://github.com/awilliam/linux-vfio:
vfio_pci: Test for extended capabilities if config space > 256 bytes
vfio_iommu_spapr_tce: Remove unneeded iommu_group_get_iommudata
vfio/pci: Add test for BAR restore
vfio/pci: Hide broken INTx support from user
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights:
- Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V
- Live patching support for ppc64le (also merged via livepatching.git)
Various cleanups & minor fixes from:
- Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie,
Lennart Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring,
Michael Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras,
Rashmica Gupta, Russell Currey, Suraj Jitindar Singh, Thiago Jung
Bauermann, Valentin Rothberg, Vipin K Parashar.
General:
- Update LMB associativity index during DLPAR add/remove from Nathan
Fontenot
- Fix branching to OOL handlers in relocatable kernel from Hari Bathini
- Add support for userspace Power9 copy/paste from Chris Smart
- Always use STRICT_MM_TYPECHECKS from Michael Ellerman
- Add mask of possible MMU features from Michael Ellerman
PCI:
- Enable pass through of NVLink to guests from Alexey Kardashevskiy
- Cleanups in preparation for powernv PCI hotplug from Gavin Shan
- Don't report error in eeh_pe_reset_and_recover() from Gavin Shan
- Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan
- Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
from Guilherme G Piccoli
- Remove the dependency on EEH struct in DDW mechanism from Guilherme
G Piccoli
selftests:
- Test cp_abort during context switch from Chris Smart
- Add several tests for transactional memory support from Rashmica
Gupta
perf:
- Add support for sampling interrupt register state from Anju T
- Add support for unwinding perf-stackdump from Chandan Kumar
cxl:
- Configure the PSL for two CAPI ports on POWER8NVL from Philippe
Bergheaud
- Allow initialization on timebase sync failures from Frederic Barrat
- Increase timeout for detection of AFU mmio hang from Frederic
Barrat
- Handle num_of_processes larger than can fit in the SPA from Ian
Munsie
- Ensure PSL interrupt is configured for contexts with no AFU IRQs
from Ian Munsie
- Add kernel API to allow a context to operate with relocate disabled
from Ian Munsie
- Check periodically the coherent platform function's state from
Christophe Lombard
Freescale:
- Updates from Scott: "Contains 86xx fixes, minor device tree fixes,
an erratum workaround, and a kconfig dependency fix."
* tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (192 commits)
powerpc/86xx: Fix PCI interrupt map definition
powerpc/86xx: Move pci1 definition to the include file
powerpc/fsl: Fix build of the dtb embedded kernel images
powerpc/fsl: Fix rcpm compatible string
powerpc/fsl: Remove FSL_SOC dependency from FSL_LBC
powerpc/fsl-pci: Add a workaround for PCI 5 errata
powerpc/fsl: Fix SPI compatible on t208xrdb and t1040rdb
powerpc/powernv/npu: Add PE to PHB's list
powerpc/powernv: Fix insufficient memory allocation
powerpc/iommu: Remove the dependency on EEH struct in DDW mechanism
Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
powerpc/eeh: Drop unnecessary label in eeh_pe_change_owner()
powerpc/eeh: Ignore handlers in eeh_pe_reset_and_recover()
powerpc/eeh: Restore initial state in eeh_pe_reset_and_recover()
powerpc/eeh: Don't report error in eeh_pe_reset_and_recover()
Revert "powerpc/powernv: Exclude root bus in pnv_pci_reset_secondary_bus()"
powerpc/powernv/npu: Enable NVLink pass through
powerpc/powernv/npu: Rework TCE Kill handling
powerpc/powernv/npu: Add set/unset window helpers
powerpc/powernv/ioda2: Export debug helper pe_level_printk()
...
|
|
PCI-Express spec says that reading 4 bytes at offset 100h should return
zero if there is no extended capability so VFIO reads this dword to
know if there are extended capabilities.
However it is not always possible to access the extended space so
generic PCI code in pci_cfg_space_size_ext() checks if
pci_read_config_dword() can read beyond 100h and if the check fails,
it sets the config space size to 100h.
VFIO does its own extended capabilities check by reading at offset 100h
which may produce 0xffffffff which VFIO treats as the extended config
space presense and calls vfio_ecap_init() which fails to parse
capabilities (which is expected) but right before the exit, it writes
zero at offset 100h which is beyond the buffer allocated for
vdev->vconfig (which is 256 bytes) which leads to random memory
corruption.
This makes VFIO only check for the extended capabilities if
the discovered config size is more than 256 bytes.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
We are going to have multiple different types of PHB on the same system
with POWER8 + NVLink and PHBs will have different IOMMU ops. However
we only really care about one callback - create_table - so we can
relax the compatibility check here.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
Many IOMMUs support multiple page table formats, meaning that any given
domain may only support a subset of the hardware page sizes presented in
iommu_ops->pgsize_bitmap. There are also certain use-cases where the
creator of a domain may want to control which page sizes are used, for
example to force the use of hugepage mappings to reduce pagetable walk
depth.
To this end, add a per-domain pgsize_bitmap to represent the subset of
page sizes actually in use, to make it possible for domains with
different requirements to coexist.
Signed-off-by: Will Deacon <will.deacon@arm.com>
[rm: hijacked and rebased original patch with new commit message]
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
This removes iommu_group_get_iommudata() as the result is never used.
As this is a minor cleanup, no change in behavior is expected.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
If a device is reset without the memory or i/o bits enabled in the
command register we may not detect it, potentially leaving the device
without valid BAR programming. Add an additional test to check the
BARs on each write to the command register.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
INTx masking has two components, the first is that we need the ability
to prevent the device from continuing to assert INTx. This is
provided via the DisINTx bit in the command register and is the only
thing we can really probe for when testing if INTx masking is
supported. The second component is that the device needs to indicate
if INTx is asserted via the interrupt status bit in the device status
register. With these two features we can generically determine if one
of the devices we own is asserting INTx, signal the user, and mask the
interrupt while the user services the device.
Generally if one or both of these components is broken we resort to
APIC level interrupt masking, which requires an exclusive interrupt
since we have no way to determine the source of the interrupt in a
shared configuration. This often makes it difficult or impossible to
configure the system for userspace use of the device, for an interrupt
mode that the user may not need.
One possible configuration of broken INTx masking is that the DisINTx
support is fully functional, but the interrupt status bit never
signals interrupt assertion. In this case we do have the ability to
prevent the device from asserting INTx, but lack the ability to
identify the interrupt source. For this case we can simply pretend
that the device lacks INTx support entirely, keeping DisINTx set on
the physical device, virtualizing this bit for the user, and
virtualizing the interrupt pin register to indicate no INTx support.
We already support virtualization of the DisINTx bit and already
virtualize the interrupt pin for platforms without INTx support. By
tying these components together, setting DisINTx on open and reset,
and identifying devices broken in this particular way, we can provide
support for them w/o the handicap of APIC level INTx masking.
Intel i40e (XL710/X710) 10/20/40GbE NICs have been identified as being
broken in this specific way. We leave the vfio-pci.nointxmask option
as a mechanism to bypass this support, enabling INTx on the device
with all the requirements of APIC level masking.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: John Ronciak <john.ronciak@intel.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
|