linux-toradex.git/virt/kvm, branch v6.11-rc3

KVM: guest_memfd: abstract how prepared folios are recorded

2024-07-26T18:46:15+00:00

Right now, large folios are not supported in guest_memfd, and therefore the order
used by kvm_gmem_populate() is always 0.  In this scenario, using the up-to-date
bit to track prepared-ness is nice and easy because we have one bit available
per page.

In the future, however, we might have large pages that are partially populated;
for example, in the case of SEV-SNP, if a large page has both shared and private
areas inside, it is necessary to populate it at a granularity that is smaller
than that of the guest_memfd's backing store.  In that case we will have
to track preparedness at a 4K level, probably as a bitmap.

In preparation for that, do not use explicitly folio_test_uptodate() and
folio_mark_uptodate().  Return the state of the page directly from
__kvm_gmem_get_pfn(), so that it is expected to apply to 2^N pages
with N=*max_order.  The function to mark a range as prepared for now
takes just a folio, but is expected to take also an index and order
(or something like that) when large pages are introduced.

Thanks to Michael Roth for pointing out the issue with large pages.

Signed-off-by: Paolo Bonzini

KVM: guest_memfd: let kvm_gmem_populate() operate only on private gfns

2024-07-26T18:46:15+00:00

This check is currently performed by sev_gmem_post_populate(), but it
applies to all callers of kvm_gmem_populate(): the point of the function
is that the memory is being encrypted and some work has to be done
on all the gfns in order to encrypt them.

Therefore, check the KVM_MEMORY_ATTRIBUTE_PRIVATE attribute prior
to invoking the callback, and stop the operation if a shared page
is encountered.  Because CONFIG_KVM_PRIVATE_MEM in principle does
not require attributes, this makes kvm_gmem_populate() depend on
CONFIG_KVM_GENERIC_PRIVATE_MEM (which does require them).

Reviewed-by: Michael Roth 
Signed-off-by: Paolo Bonzini

KVM: extend kvm_range_has_memory_attributes() to check subset of attributes

2024-07-26T18:46:15+00:00

While currently there is no other attribute than KVM_MEMORY_ATTRIBUTE_PRIVATE,
KVM code such as kvm_mem_is_private() is written to expect their existence.
Allow using kvm_range_has_memory_attributes() as a multi-page version of
kvm_mem_is_private(), without it breaking later when more attributes are
introduced.

Reviewed-by: Michael Roth 
Signed-off-by: Paolo Bonzini

KVM: cleanup and add shortcuts to kvm_range_has_memory_attributes()

2024-07-26T18:46:14+00:00

Use a guard to simplify early returns, and add two more easy
shortcuts.  If the requested attributes are invalid, the attributes
xarray will never show them as set.  And if testing a single page,
kvm_get_memory_attributes() is more efficient.

Reviewed-by: Michael Roth 
Signed-off-by: Paolo Bonzini

KVM: guest_memfd: move check for already-populated page to common code

2024-07-26T18:46:14+00:00

Do not allow populating the same page twice with startup data.  In the
case of SEV-SNP, for example, the firmware does not allow it anyway,
since the launch-update operation is only possible on pages that are
still shared in the RMP.

Even if it worked, kvm_gmem_populate()'s callback is meant to have side
effects such as updating launch measurements, and updating the same
page twice is unlikely to have the desired results.

Races between calls to the ioctl are not possible because
kvm_gmem_populate() holds slots_lock and the VM should not be running.
But again, even if this worked on other confidential computing technology,
it doesn't matter to guest_memfd.c whether this is something fishy
such as missing synchronization in userspace, or rather something
intentional.  One of the racers wins, and the page is initialized by
either kvm_gmem_prepare_folio() or kvm_gmem_populate().

Anyway, out of paranoia, adjust sev_gmem_post_populate() anyway to use
the same errno that kvm_gmem_populate() is using.

Reviewed-by: Michael Roth 
Signed-off-by: Paolo Bonzini

KVM: remove kvm_arch_gmem_prepare_needed()

2024-07-26T18:46:14+00:00

It is enough to return 0 if a guest need not do any preparation.
This is in fact how sev_gmem_prepare() works for non-SNP guests,
and it extends naturally to Intel hosts: the x86 callback for
gmem_prepare is optional and returns 0 if not defined.

Reviewed-by: Michael Roth 
Signed-off-by: Paolo Bonzini

KVM: guest_memfd: make kvm_gmem_prepare_folio() operate on a single struct kvm

2024-07-26T18:46:14+00:00

This is now possible because preparation is done by kvm_gmem_get_pfn()
instead of fallocate().  In practice this is not a limitation, because
even though guest_memfd can be bound to multiple struct kvm, for
hardware implementations of confidential computing only one guest
(identified by an ASID on SEV-SNP, or an HKID on TDX) will be able
to access it.

In the case of intra-host migration (not implemented yet for SEV-SNP,
but we can use SEV-ES as an idea of how it will work), the new struct
kvm inherits the same ASID and preparation need not be repeated.

Reviewed-by: Michael Roth 
Signed-off-by: Paolo Bonzini

KVM: guest_memfd: delay kvm_gmem_prepare_folio() until the memory is passed to the guest

2024-07-26T18:46:14+00:00

Initializing the contents of the folio on fallocate() is unnecessarily
restrictive.  It means that the page is registered with the firmware and
then it cannot be touched anymore.  In particular, this loses the
possibility of using fallocate() to pre-allocate the page for SEV-SNP
guests, because kvm_arch_gmem_prepare() then fails.

It's only when the guest actually accesses the page (and therefore
kvm_gmem_get_pfn() is called) that the page must be cleared from any
stale host data and registered with the firmware.  The up-to-date flag
is clear if this has to be done (i.e. it is the first access and
kvm_gmem_populate() has not been called).

All in all, there are enough differences between kvm_gmem_get_pfn() and
kvm_gmem_populate(), that it's better to separate the two flows completely.
Extract the bulk of kvm_gmem_get_folio(), which take a folio and end up
setting its up-to-date flag, to a new function kvm_gmem_prepare_folio();
these are now done only by the non-__-prefixed kvm_gmem_get_pfn().
As a bonus, __kvm_gmem_get_pfn() loses its ugly "bool prepare" argument.

One difference is that fallocate(PUNCH_HOLE) can now race with a
page fault.  Potentially this causes a page to be prepared and into the
filemap even after fallocate(PUNCH_HOLE).  This is harmless, as it can be
fixed by another hole punching operation, and can be avoided by clearing
the private-page attribute prior to invoking fallocate(PUNCH_HOLE).
This way, the page fault will cause an exit to user space.

The previous semantics, where fallocate() could be used to prepare
the pages in advance of running the guest, can be accessed with
KVM_PRE_FAULT_MEMORY.

For now, accessing a page in one VM will attempt to call
kvm_arch_gmem_prepare() in all of those that have bound the guest_memfd.
Cleaning this up is left to a separate patch.

Suggested-by: Sean Christopherson 
Signed-off-by: Paolo Bonzini

KVM: guest_memfd: return locked folio from __kvm_gmem_get_pfn

2024-07-26T18:46:14+00:00

Allow testing the up-to-date flag in the caller without taking the
lock again.

Reviewed-by: Michael Roth 
Signed-off-by: Paolo Bonzini

KVM: rename CONFIG_HAVE_KVM_GMEM_* to CONFIG_HAVE_KVM_ARCH_GMEM_*

2024-07-26T18:46:14+00:00

Add "ARCH" to the symbols; shortly, the "prepare" phase will include both
the arch-independent step to clear out contents left in the page by the
host, and the arch-dependent step enabled by CONFIG_HAVE_KVM_GMEM_PREPARE.
For consistency do the same for CONFIG_HAVE_KVM_GMEM_INVALIDATE as well.

Reviewed-by: Michael Roth 
Signed-off-by: Paolo Bonzini