linux-toradex.git/Documentation/virtual, branch tegra

KVM: Device assignment permission checks

2012-01-18T15:31:54+00:00

(cherry picked from commit 3d27e23b17010c668db311140b17bbbb70c78fb9)

Only allow KVM device assignment to attach to devices which:

 - Are not bridges
 - Have BAR resources (assume others are special devices)
 - The user has permissions to use

Assigning a bridge is a configuration error, it's not supported, and
typically doesn't result in the behavior the user is expecting anyway.
Devices without BAR resources are typically chipset components that
also don't have host drivers.  We don't want users to hold such devices
captive or cause system problems by fencing them off into an iommu
domain.  We determine "permission to use" by testing whether the user
has access to the PCI sysfs resource files.  By default a normal user
will not have access to these files, so it provides a good indication
that an administration agent has granted the user access to the device.

[Yang Bai: add missing #include]
[avi: fix comment style]

Signed-off-by: Alex Williamson 
Signed-off-by: Yang Bai 
Signed-off-by: Marcelo Tosatti 
Signed-off-by: Greg Kroah-Hartman

KVM: Remove ability to assign a device without iommu support

2012-01-18T15:31:53+00:00

(cherry picked from commit 423873736b78f549fbfa2f715f2e4de7e6c5e1e9)

This option has no users and it exposes a security hole that we
can allow devices to be assigned without iommu protection.  Make
KVM_DEV_ASSIGN_ENABLE_IOMMU a mandatory option.

Signed-off-by: Alex Williamson 
Signed-off-by: Marcelo Tosatti 
Signed-off-by: Greg Kroah-Hartman

lguest: allow booting guest with CONFIG_RELOCATABLE=y

2011-08-15T00:45:10+00:00

The CONFIG_RELOCATABLE code tries to align the unpack destination to
the value of 'kernel_alignment' in the setup_hdr.  If that's 0, it
tries to unpack to address 0, which in fact causes the gunzip code
to call 'error("Out of memory while allocating output buffer")'.

The bootloader (ie. the lguest Launcher in this case) should be doing
setting this field; the normal bzImage is 16M, we can use the same.

Reported-by: Stefanos Geraggelos 
Signed-off-by: Rusty Russell 
Cc: stable@kernel.org

virtio: Add text copy of spec to Documentation/virtual.

2011-08-15T00:45:10+00:00

As suggested by Christoph Hellwig.

Signed-off-by: Rusty Russell

Merge branch 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

2011-07-24T16:07:03+00:00

* 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits)
  KVM: IOMMU: Disable device assignment without interrupt remapping
  KVM: MMU: trace mmio page fault
  KVM: MMU: mmio page fault support
  KVM: MMU: reorganize struct kvm_shadow_walk_iterator
  KVM: MMU: lockless walking shadow page table
  KVM: MMU: do not need atomicly to set/clear spte
  KVM: MMU: introduce the rules to modify shadow page table
  KVM: MMU: abstract some functions to handle fault pfn
  KVM: MMU: filter out the mmio pfn from the fault pfn
  KVM: MMU: remove bypass_guest_pf
  KVM: MMU: split kvm_mmu_free_page
  KVM: MMU: count used shadow pages on prepareing path
  KVM: MMU: rename 'pt_write' to 'emulate'
  KVM: MMU: cleanup for FNAME(fetch)
  KVM: MMU: optimize to handle dirty bit
  KVM: MMU: cache mmio info on page fault path
  KVM: x86: introduce vcpu_mmio_gva_to_gpa to cleanup the code
  KVM: MMU: do not update slot bitmap if spte is nonpresent
  KVM: MMU: fix walking shadow page table
  KVM guest: KVM Steal time registration
  ...

lguest: update comments

2011-07-22T05:09:50+00:00

Also removes a long-unused #define and an extraneous semicolon.

Signed-off-by: Rusty Russell

lguest: Simplify device initialization.

2011-07-22T05:09:49+00:00

We used to notify the Host every time we updated a device's status.  However,
it only really needs to know when we're resetting the device, or failed to
initialize it, or when we've finished our feature negotiation.

In particular, we used to wait for VIRTIO_CONFIG_S_DRIVER_OK in the
status byte before starting the device service threads.  But this
corresponds to the successful finish of device initialization, which
might (like virtio_blk's partition scanning) use the device.  So we
had a hack, if they used the device before we expected we started the
threads anyway.

Now we hook into the finalize_features hook in the Guest: at that
point we tell the Launcher that it can rely on the features we have
acked.  On the Launcher side, we look at the status at that point, and
start servicing the device.

Signed-off-by: Rusty Russell

lguest: Do not exit on non-fatal errors

2011-07-22T05:09:48+00:00

Do not exit on some non-fatal errors:

- writev() fails in net_output(). The result is a lost packet or packets.
- writev() fails in console_output(). The result is partially lost console
output.
- readv() fails in net_input(). The result is a lost packet or packets.

Rather than bringing the guest down, this patch ignores e.g. an allocation
failure on the host side. Example:

lguest: page allocation failure. order:4, mode:0x4d0
Pid: 4045, comm: lguest Tainted: G        W   2.6.36 #1
Call Trace:
 [] ? printk+0x18/0x1c
 [] __alloc_pages_nodemask+0x4d2/0x570
 [] cache_alloc_refill+0x2a4/0x4d0
 [] ? __netif_receive_skb+0x189/0x270
 [] __kmalloc+0xda/0xf0
 [] __alloc_skb+0x55/0x100
 [] ? net_rx_action+0x79/0x100
 [] sock_alloc_send_pskb+0x18d/0x280
 [] ? _copy_from_user+0x35/0x130
 [] ? memcpy_fromiovecend+0x56/0x80
 [] tun_chr_aio_write+0x1cc/0x500
 [] do_sync_readv_writev+0x95/0xd0
 [] ? _copy_from_user+0x35/0x130
 [] ? rw_copy_check_uvector+0x58/0x100
 [] do_readv_writev+0x9c/0x1d0
 [] ? tun_chr_aio_write+0x0/0x500
 [] vfs_writev+0x4a/0x60
 [] sys_writev+0x41/0x80
 [] syscall_call+0x7/0xb
Mem-Info:
DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
Normal per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
HighMem per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
active_anon:134651 inactive_anon:50543 isolated_anon:0
 active_file:96881 inactive_file:132007 isolated_file:0
 unevictable:0 dirty:3 writeback:0 unstable:0
 free:91374 slab_reclaimable:6300 slab_unreclaimable:2802
 mapped:2281 shmem:9 pagetables:330 bounce:0
DMA free:3524kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:8kB active_file:8760kB inactive_file:2760kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15868kB mlocked:0kB dirty:0kB writeback:0kB mapped:16kB shmem:0kB slab_reclaimable:88kB slab_unreclaimable:148kB kernel_stack:40kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 865 2016 2016
Normal free:150100kB min:3728kB low:4660kB high:5592kB active_anon:6224kB inactive_anon:15772kB active_file:324084kB inactive_file:325944kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:885944kB mlocked:0kB dirty:12kB writeback:0kB mapped:1520kB shmem:0kB slab_reclaimable:25112kB slab_unreclaimable:11060kB kernel_stack:1888kB pagetables:1320kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 9207 9207
HighMem free:211872kB min:512kB low:1752kB high:2992kB active_anon:532380kB inactive_anon:186392kB active_file:54680kB inactive_file:199324kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1178504kB mlocked:0kB dirty:0kB writeback:0kB mapped:7588kB shmem:36kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 3*4kB 65*8kB 35*16kB 18*32kB 11*64kB 9*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3524kB
Normal: 35981*4kB 344*8kB 158*16kB 28*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 150100kB
HighMem: 5732*4kB 5462*8kB 2826*16kB 1598*32kB 84*64kB 10*128kB 7*256kB 1*512kB 1*1024kB 1*2048kB 9*4096kB = 211872kB
231237 total pagecache pages
2340 pages in swap cache
Swap cache stats: add 160060, delete 157720, find 189017/194106
Free swap  = 4179840kB
Total swap = 4194300kB
524271 pages RAM
296946 pages HighMem
5668 pages reserved
867664 pages shared
82155 pages non-shared

Signed-off-by: Sakari Ailus 
Signed-off-by: Rusty Russell

KVM: KVM Steal time guest/host interface

2011-07-12T10:17:03+00:00

To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.

In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
holds the memory area address containing information about steal time

This patch contains the headers for it. I am keeping it separate to facilitate
backports to people who wants to backport the kernel part but not the
hypervisor, or the other way around.

Signed-off-by: Glauber Costa 
Acked-by: Rik van Riel 
Tested-by: Eric B Munson 
CC: Jeremy Fitzhardinge 
CC: Peter Zijlstra 
CC: Anthony Liguori 
Signed-off-by: Avi Kivity

KVM: PPC: Allocate RMAs (Real Mode Areas) at boot for use by guests

2011-07-12T10:16:57+00:00

This adds infrastructure which will be needed to allow book3s_hv KVM to
run on older POWER processors, including PPC970, which don't support
the Virtual Real Mode Area (VRMA) facility, but only the Real Mode
Offset (RMO) facility.  These processors require a physically
contiguous, aligned area of memory for each guest.  When the guest does
an access in real mode (MMU off), the address is compared against a
limit value, and if it is lower, the address is ORed with an offset
value (from the Real Mode Offset Register (RMOR)) and the result becomes
the real address for the access.  The size of the RMA has to be one of
a set of supported values, which usually includes 64MB, 128MB, 256MB
and some larger powers of 2.

Since we are unlikely to be able to allocate 64MB or more of physically
contiguous memory after the kernel has been running for a while, we
allocate a pool of RMAs at boot time using the bootmem allocator.  The
size and number of the RMAs can be set using the kvm_rma_size=xx and
kvm_rma_count=xx kernel command line options.

KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability
of the pool of preallocated RMAs.  The capability value is 1 if the
processor can use an RMA but doesn't require one (because it supports
the VRMA facility), or 2 if the processor requires an RMA for each guest.

This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the
pool and returns a file descriptor which can be used to map the RMA.  It
also returns the size of the RMA in the argument structure.

Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION
ioctl calls from userspace.  To cope with this, we now preallocate the
kvm->arch.ram_pginfo array when the VM is created with a size sufficient
for up to 64GB of guest memory.  Subsequently we will get rid of this
array and use memory associated with each memslot instead.

This moves most of the code that translates the user addresses into
host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level
to kvmppc_core_prepare_memory_region.  Also, instead of having to look
up the VMA for each page in order to check the page size, we now check
that the pages we get are compound pages of 16MB.  However, if we are
adding memory that is mapped to an RMA, we don't bother with calling
get_user_pages_fast and instead just offset from the base pfn for the
RMA.

Typically the RMA gets added after vcpus are created, which makes it
inconvenient to have the LPCR (logical partition control register) value
in the vcpu->arch struct, since the LPCR controls whether the processor
uses RMA or VRMA for the guest.  This moves the LPCR value into the
kvm->arch struct and arranges for the MER (mediated external request)
bit, which is the only bit that varies between vcpus, to be set in
assembly code when going into the guest if there is a pending external
interrupt request.

Signed-off-by: Paul Mackerras 
Signed-off-by: Alexander Graf