linux-toradex.git/virt, branch v4.4.93

vfio: New external user group/file match

2017-07-27T22:06:07+00:00

commit 5d6dee80a1e94cc284d03e06d930e60e8d3ecf7d upstream.

At the point where the kvm-vfio pseudo device wants to release its
vfio group reference, we can't always acquire a new reference to make
that happen.  The group can be in a state where we wouldn't allow a
new reference to be added.  This new helper function allows a caller
to match a file to a group to facilitate this.  Given a file and
group, report if they match.  Thus the caller needs to already have a
group reference to match to the file.  This allows the deletion of a
group without acquiring a new reference.

Signed-off-by: Alex Williamson 
Reviewed-by: Eric Auger 
Reviewed-by: Paolo Bonzini 
Tested-by: Eric Auger 
Signed-off-by: Greg Kroah-Hartman

KVM: kvm_io_bus_unregister_dev() should never fail

2017-04-08T07:53:32+00:00

commit 90db10434b163e46da413d34db8d0e77404cc645 upstream.

No caller currently checks the return value of
kvm_io_bus_unregister_dev(). This is evil, as all callers silently go on
freeing their device. A stale reference will remain in the io_bus,
getting at least used again, when the iobus gets teared down on
kvm_destroy_vm() - leading to use after free errors.

There is nothing the callers could do, except retrying over and over
again.

So let's simply remove the bus altogether, print an error and make
sure no one can access this broken bus again (returning -ENOMEM on any
attempt to access it).

Fixes: e93f8a0f821e ("KVM: convert io_bus to SRCU")
Reported-by: Dmitry Vyukov 
Reviewed-by: Cornelia Huck 
Signed-off-by: David Hildenbrand 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Greg Kroah-Hartman

KVM: x86: clear bus pointer when destroyed

2017-04-08T07:53:31+00:00

commit df630b8c1e851b5e265dc2ca9c87222e342c093b upstream.

When releasing the bus, let's clear the bus pointers to mark it out. If
any further device unregister happens on this bus, we know that we're
done if we found the bus being released already.

Signed-off-by: Peter Xu 
Signed-off-by: Radim Krčmář 
Signed-off-by: Greg Kroah-Hartman

KVM: eventfd: fix NULL deref irqbypass consumer

2017-01-19T19:17:19+00:00

commit 4f3dbdf47e150016aacd734e663347fcaa768303 upstream.

Reported syzkaller:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    IP: irq_bypass_unregister_consumer+0x9d/0xb70 [irqbypass]
    PGD 0

    Oops: 0002 [#1] SMP
    CPU: 1 PID: 125 Comm: kworker/1:1 Not tainted 4.9.0+ #1
    Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm]
    task: ffff9bbe0dfbb900 task.stack: ffffb61802014000
    RIP: 0010:irq_bypass_unregister_consumer+0x9d/0xb70 [irqbypass]
    Call Trace:
     irqfd_shutdown+0x66/0xa0 [kvm]
     process_one_work+0x16b/0x480
     worker_thread+0x4b/0x500
     kthread+0x101/0x140
     ? process_one_work+0x480/0x480
     ? kthread_create_on_node+0x60/0x60
     ret_from_fork+0x25/0x30
    RIP: irq_bypass_unregister_consumer+0x9d/0xb70 [irqbypass] RSP: ffffb61802017e20
    CR2: 0000000000000008

The syzkaller folks reported a NULL pointer dereference that due to
unregister an consumer which fails registration before. The syzkaller
creates two VMs w/ an equal eventfd occasionally. So the second VM
fails to register an irqbypass consumer. It will make irqfd as inactive
and queue an workqueue work to shutdown irqfd and unregister the irqbypass
consumer when eventfd is closed. However, the second consumer has been
initialized though it fails registration. So the token(same as the first
VM's) is taken to unregister the consumer through the workqueue, the
consumer of the first VM is found and unregistered, then NULL deref incurred
in the path of deleting consumer from the consumers list.

This patch fixes it by making irq_bypass_register/unregister_consumer()
looks for the consumer entry based on consumer pointer itself instead of
token matching.

Reported-by: Dmitry Vyukov 
Suggested-by: Alex Williamson 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Dmitry Vyukov 
Cc: Alex Williamson 
Signed-off-by: Wanpeng Li 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Greg Kroah-Hartman

KVM: nVMX: Fix memory corruption when using VMCS shadowing

2016-08-20T16:09:18+00:00

commit 2f1fe81123f59271bddda673b60116bde9660385 upstream.

When freeing the nested resources of a vcpu, there is an assumption that
the vcpu's vmcs01 is the current VMCS on the CPU that executes
nested_release_vmcs12(). If this assumption is violated, the vcpu's
vmcs01 may be made active on multiple CPUs at the same time, in
violation of Intel's specification. Moreover, since the vcpu's vmcs01 is
not VMCLEARed on every CPU on which it is active, it can linger in a
CPU's VMCS cache after it has been freed and potentially
repurposed. Subsequent eviction from the CPU's VMCS cache on a capacity
miss can result in memory corruption.

It is not sufficient for vmx_free_vcpu() to call vmx_load_vmcs01(). If
the vcpu in question was last loaded on a different CPU, it must be
migrated to the current CPU before calling vmx_load_vmcs01().

Signed-off-by: Jim Mattson 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Greg Kroah-Hartman

kvm: Fix irq route entries exceeding KVM_MAX_IRQ_ROUTES

2016-07-27T16:47:31+00:00

commit caf1ff26e1aa178133df68ac3d40815fed2187d9 upstream.

These days, we experienced one guest crash with 8 cores and 3 disks,
with qemu error logs as bellow:

qemu-system-x86_64: /build/qemu-2.0.0/kvm-all.c:984:
kvm_irqchip_commit_routes: Assertion `ret == 0' failed.

And then we found one patch(bdf026317d) in qemu tree, which said
could fix this bug.

Execute the following script will reproduce the BUG quickly:

irq_affinity.sh
========================================================================

vda_irq_num=25
vdb_irq_num=27
while [ 1 ]
do
    for irq in {1,2,4,8,10,20,40,80}
        do
            echo $irq > /proc/irq/$vda_irq_num/smp_affinity
            echo $irq > /proc/irq/$vdb_irq_num/smp_affinity
            dd if=/dev/vda of=/dev/zero bs=4K count=100 iflag=direct
            dd if=/dev/vdb of=/dev/zero bs=4K count=100 iflag=direct
        done
done
========================================================================

The following qemu log is added in the qemu code and is displayed when
this bug reproduced:

kvm_irqchip_commit_routes: max gsi: 1008, nr_allocated_irq_routes: 1024,
irq_routes->nr: 1024, gsi_count: 1024.

That's to say when irq_routes->nr == 1024, there are 1024 routing entries,
but in the kernel code when routes->nr >= 1024, will just return -EINVAL;

The nr is the number of the routing entries which is in of
[1 ~ KVM_MAX_IRQ_ROUTES], not the index in [0 ~ KVM_MAX_IRQ_ROUTES - 1].

This patch fix the BUG above.

Signed-off-by: Xiubo Li 
Signed-off-by: Wei Tang 
Signed-off-by: Zhang Zhuoyu 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Greg Kroah-Hartman

KVM: irqfd: fix NULL pointer dereference in kvm_irq_map_gsi

2016-06-24T17:18:18+00:00

commit c622a3c21ede892e370b56e1ceb9eb28f8bbda6b upstream.

Found by syzkaller:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000120
    IP: [] kvm_irq_map_gsi+0x12/0x90 [kvm]
    PGD 6f80b067 PUD b6535067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 3 PID: 4988 Comm: a.out Not tainted 4.4.9-300.fc23.x86_64 #1
    [...]
    Call Trace:
     [] irqfd_update+0x32/0xc0 [kvm]
     [] kvm_irqfd+0x3dc/0x5b0 [kvm]
     [] kvm_vm_ioctl+0x164/0x6f0 [kvm]
     [] do_vfs_ioctl+0x298/0x480
     [] SyS_ioctl+0x79/0x90
     [] tracesys_phase2+0x84/0x89
    Code: b5 71 a7 e0 5b 41 5c 41 5d 5d f3 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 8f 10 2e 00 00 31 c0 48 89 e5 <39> 91 20 01 00 00 76 6a 48 63 d2 48 8b 94 d1 28 01 00 00 48 85
    RIP  [] kvm_irq_map_gsi+0x12/0x90 [kvm]
     RSP 
    CR2: 0000000000000120

Testcase:

    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 

    long r[26];

    int main()
    {
        memset(r, -1, sizeof(r));
        r[2] = open("/dev/kvm", 0);
        r[3] = ioctl(r[2], KVM_CREATE_VM, 0);

        struct kvm_irqfd ifd;
        ifd.fd = syscall(SYS_eventfd2, 5, 0);
        ifd.gsi = 3;
        ifd.flags = 2;
        ifd.resamplefd = ifd.fd;
        r[25] = ioctl(r[3], KVM_IRQFD, &ifd);
        return 0;
    }

Reported-by: Dmitry Vyukov 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Radim Krčmář 
Signed-off-by: Greg Kroah-Hartman

KVM: arm/arm64: Handle forward time correction gracefully

2016-05-04T21:48:40+00:00

commit 1c5631c73fc2261a5df64a72c155cb53dcdc0c45 upstream.

On a host that runs NTP, corrections can have a direct impact on
the background timer that we program on the behalf of a vcpu.

In particular, NTP performing a forward correction will result in
a timer expiring sooner than expected from a guest point of view.
Not a big deal, we kick the vcpu anyway.

But on wake-up, the vcpu thread is going to perform a check to
find out whether or not it should block. And at that point, the
timer check is going to say "timer has not expired yet, go back
to sleep". This results in the timer event being lost forever.

There are multiple ways to handle this. One would be record that
the timer has expired and let kvm_cpu_has_pending_timer return
true in that case, but that would be fairly invasive. Another is
to check for the "short sleep" condition in the hrtimer callback,
and restart the timer for the remaining time when the condition
is detected.

This patch implements the latter, with a bit of refactoring in
order to avoid too much code duplication.

Reported-by: Alexander Graf 
Reviewed-by: Alexander Graf 
Signed-off-by: Marc Zyngier 
Signed-off-by: Christoffer Dall 
Signed-off-by: Greg Kroah-Hartman

KVM: fix spin_lock_init order on x86

2016-04-12T16:08:34+00:00

commit e9ad4ec8379ad1ba6f68b8ca1c26b50b5ae0a327 upstream.

Moving the initialization earlier is needed in 4.6 because
kvm_arch_init_vm is now using mmu_lock, causing lockdep to
complain:

[  284.440294] INFO: trying to register non-static key.
[  284.445259] the code is fine but needs lockdep annotation.
[  284.450736] turning off the locking correctness validator.
...
[  284.528318]  [] lock_acquire+0xd3/0x240
[  284.533733]  [] ? kvm_page_track_register_notifier+0x20/0x60 [kvm]
[  284.541467]  [] _raw_spin_lock+0x41/0x80
[  284.546960]  [] ? kvm_page_track_register_notifier+0x20/0x60 [kvm]
[  284.554707]  [] kvm_page_track_register_notifier+0x20/0x60 [kvm]
[  284.562281]  [] kvm_mmu_init_vm+0x20/0x30 [kvm]
[  284.568381]  [] kvm_arch_init_vm+0x1ea/0x200 [kvm]
[  284.574740]  [] kvm_dev_ioctl+0xbf/0x4d0 [kvm]

However, it also helps fixing a preexisting problem, which is why this
patch is also good for stable kernels: kvm_create_vm was incrementing
current->mm->mm_count but not decrementing it at the out_err label (in
case kvm_init_mmu_notifier failed).  The new initialization order makes
it possible to add the required mmdrop without adding a new error label.

Reported-by: Borislav Petkov 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Greg Kroah-Hartman

kvm: cap halt polling at exactly halt_poll_ns

2016-03-16T15:42:58+00:00

commit 313f636d5c490c9741d3f750dc8da33029edbc6b upstream.

When growing halt-polling, there is no check that the poll time exceeds
the limit. It's possible for vcpu->halt_poll_ns grow once past
halt_poll_ns, and stay there until a halt which takes longer than
vcpu->halt_poll_ns. For example, booting a Linux guest with
halt_poll_ns=11000:

 ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 0 (shrink 10000)
 ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 10000 (grow 0)
 ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (grow 10000)

Signed-off-by: David Matlack 
Fixes: aca6ff29c4063a8d467cdee241e6b3bf7dc4a171
Signed-off-by: Paolo Bonzini 
Signed-off-by: Greg Kroah-Hartman