summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/vmx.c
AgeCommit message (Collapse)Author
2015-05-06KVM: VMX: Preserve host CR4.MCE value while in guest mode.Ben Serebrin
commit 085e68eeafbf76e21848ad5bafaecec88a11dd64 upstream. The host's decision to enable machine check exceptions should remain in force during non-root mode. KVM was writing 0 to cr4 on VCPU reset and passed a slightly-modified 0 to the vmcs.guest_cr4 value. Tested: Built. On earlier version, tested by injecting machine check while a guest is spinning. Before the change, if guest CR4.MCE==0, then the machine check is escalated to Catastrophic Error (CATERR) and the machine dies. If guest CR4.MCE==1, then the machine check causes VMEXIT and is handled normally by host Linux. After the change, injecting a machine check causes normal Linux machine check handling. Signed-off-by: Ben Serebrin <serebrin@google.com> Reviewed-by: Venkatesh Srinivas <venkateshs@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-17KVM: nVMX: mask unrestricted_guest if disabled on L0Radim Krčmář
If EPT was enabled, unrestricted_guest was allowed in L1 regardless of L0. L1 triple faulted when running L2 guest that required emulation. Another side effect was 'WARN_ON_ONCE(vmx->nested.nested_run_pending)' in L0's dmesg: WARNING: CPU: 0 PID: 0 at arch/x86/kvm/vmx.c:9190 nested_vmx_vmexit+0x96e/0xb00 [kvm_intel] () Prevent this scenario by masking SECONDARY_EXEC_UNRESTRICTED_GUEST when the host doesn't have it enabled. Fixes: 78051e3b7e35 ("KVM: nVMX: Disable unrestricted mode if ept=0") Cc: stable@vger.kernel.org Tested-By: Kashyap Chamarthy <kchamart@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2015-03-13KVM: VMX: Set msr bitmap correctly if vcpu is in guest modeWincy Van
In commit 3af18d9c5fe9 ("KVM: nVMX: Prepare for using hardware MSR bitmap"), we are setting MSR_BITMAP in prepare_vmcs02 if we should use hardware. This is not enough since the field will be modified by following vmx_set_efer. Fix this by setting vmx_msr_bitmap_nested in vmx_set_msr_bitmap if vcpu is in guest mode. Signed-off-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2015-02-23KVM: VMX: fix build without CONFIG_SMPRadim Krčmář
'apic' is not defined if !CONFIG_X86_64 && !CONFIG_X86_LOCAL_APIC. Posted interrupt makes no sense without CONFIG_SMP, and CONFIG_X86_LOCAL_APIC will be set with it. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-16Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 perf updates from Ingo Molnar: "This series tightens up RDPMC permissions: currently even highly sandboxed x86 execution environments (such as seccomp) have permission to execute RDPMC, which may leak various perf events / PMU state such as timing information and other CPU execution details. This 'all is allowed' RDPMC mode is still preserved as the (non-default) /sys/devices/cpu/rdpmc=2 setting. The new default is that RDPMC access is only allowed if a perf event is mmap-ed (which is needed to correctly interpret RDPMC counter values in any case). As a side effect of these changes CR4 handling is cleaned up in the x86 code and a shadow copy of the CR4 value is added. The extra CR4 manipulation adds ~ <50ns to the context switch cost between rdpmc-capable and rdpmc-non-capable mms" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Add /sys/devices/cpu/rdpmc=2 to allow rdpmc for all tasks perf/x86: Only allow rdpmc if a perf_event is mapped perf: Pass the event to arch_perf_update_userpage() perf: Add pmu callbacks to track event mapping and unmapping x86: Add a comment clarifying LDT context switching x86: Store a per-cpu shadow copy of CR4 x86: Clean up cr4 manipulation
2015-02-10KVM: x86: fix build with !CONFIG_SMPRadim Krčmář
<asm/apic.h> isn't included directly and without CONFIG_SMP, an option that automagically pulls it can't be enabled. Reported-by: Jim Davis <jim.epost@gmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-04x86: Store a per-cpu shadow copy of CR4Andy Lutomirski
Context switches and TLB flushes can change individual bits of CR4. CR4 reads take several cycles, so store a shadow copy of CR4 in a per-cpu variable. To avoid wasting a cache line, I added the CR4 shadow to cpu_tlbstate, which is already touched in switch_mm. The heaviest users of the cr4 shadow will be switch_mm and __switch_to_xtra, and __switch_to_xtra is called shortly after switch_mm during context switch, so the cacheline is likely to be hot. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Kees Cook <keescook@chromium.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vince Weaver <vince@deater.net> Cc: "hillf.zj" <hillf.zj@alibaba-inc.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/3a54dd3353fffbf84804398e00dfdc5b7c1afd7d.1414190806.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04x86: Clean up cr4 manipulationAndy Lutomirski
CR4 manipulation was split, seemingly at random, between direct (write_cr4) and using a helper (set/clear_in_cr4). Unfortunately, the set_in_cr4 and clear_in_cr4 helpers also poke at the boot code, which only a small subset of users actually wanted. This patch replaces all cr4 access in functions that don't leave cr4 exactly the way they found it with new helpers cr4_set_bits, cr4_clear_bits, and cr4_set_bits_and_update_boot. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vince Weaver <vince@deater.net> Cc: "hillf.zj" <hillf.zj@alibaba-inc.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/495a10bdc9e67016b8fd3945700d46cfd5c12c2f.1414190806.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-03KVM: nVMX: Enable nested posted interrupt processingWincy Van
If vcpu has a interrupt in vmx non-root mode, injecting that interrupt requires a vmexit. With posted interrupt processing, the vmexit is not needed, and interrupts are fully taken care of by hardware. In nested vmx, this feature avoids much more vmexits than non-nested vmx. When L1 asks L0 to deliver L1's posted interrupt vector, and the target VCPU is in non-root mode, we use a physical ipi to deliver POSTED_INTR_NV to the target vCPU. Using POSTED_INTR_NV avoids unexpected interrupts if a concurrent vmexit happens and L1's vector is different with L0's. The IPI triggers posted interrupt processing in the target physical CPU. In case the target vCPU was not in guest mode, complete the posted interrupt delivery on the next entry to L2. Signed-off-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-03KVM: nVMX: Enable nested virtual interrupt deliveryWincy Van
With virtual interrupt delivery, the hardware lets KVM use a more efficient mechanism for interrupt injection. This is an important feature for nested VMX, because it reduces vmexits substantially and they are much more expensive with nested virtualization. This is especially important for throughput-bound scenarios. Signed-off-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-03KVM: nVMX: Enable nested apic register virtualizationWincy Van
We can reduce apic register virtualization cost with this feature, it is also a requirement for virtual interrupt delivery and posted interrupt processing. Signed-off-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-03KVM: nVMX: Make nested control MSRs per-cpuWincy Van
To enable nested apicv support, we need per-cpu vmx control MSRs: 1. If in-kernel irqchip is enabled, we can enable nested posted interrupt, we should set posted intr bit in the nested_vmx_pinbased_ctls_high. 2. If in-kernel irqchip is disabled, we can not enable nested posted interrupt, the posted intr bit in the nested_vmx_pinbased_ctls_high will be cleared. Since there would be different settings about in-kernel irqchip between VMs, different nested control MSRs are needed. Signed-off-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-03KVM: nVMX: Enable nested virtualize x2apic modeWincy Van
When L2 is using x2apic, we can use virtualize x2apic mode to gain higher performance, especially in apicv case. This patch also introduces nested_vmx_check_apicv_controls for the nested apicv patches. Signed-off-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-03KVM: nVMX: Prepare for using hardware MSR bitmapWincy Van
Currently, if L1 enables MSR_BITMAP, we will emulate this feature, all of L2's msr access is intercepted by L0. Features like "virtualize x2apic mode" require that the MSR bitmap is enabled, or the hardware will exit and for example not virtualize the x2apic MSRs. In order to let L1 use these features, we need to build a merged bitmap that only not cause a VMEXIT if 1) L1 requires that 2) the bit is not required by the processor for APIC virtualization. For now the guests are still run with MSR bitmap disabled, but this patch already introduces nested_vmx_merge_msr_bitmap for future use. Signed-off-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-02KVM: x86: revert "add method to test PIR bitmap vector"Marcelo Tosatti
Revert 7c6a98dfa1ba9dc64a62e73624ecea9995736bbd, given that testing PIR is not necessary anymore. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-30kvm: vmx: fix oops with explicit flexpriority=0 optionPaolo Bonzini
A function pointer was not NULLed, causing kvm_vcpu_reload_apic_access_page to go down the wrong path and OOPS when doing put_page(NULL). This did not happen on old processors, only when setting the module option explicitly. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-30KVM: VMX: Add PML support in VMXKai Huang
This patch adds PML support in VMX. A new module parameter 'enable_pml' is added to allow user to enable/disable it manually. Signed-off-by: Kai Huang <kai.huang@linux.intel.com> Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-19x86: kvm: vmx: Remove some unused functionsRickard Strandqvist
Removes some functions that are not used anywhere: cpu_has_vmx_eptp_writeback() cpu_has_vmx_eptp_uncacheable() This was partially found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08KVM: x86: mmu: remove argument to kvm_init_shadow_mmu and ↵Paolo Bonzini
kvm_init_shadow_ept_mmu The initialization function in mmu.c can always use walk_mmu, which is known to be vcpu->arch.mmu. Only init_kvm_nested_mmu is used to initialize vcpu->arch.nested_mmu. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08KVM: x86: add method to test PIR bitmap vectorMarcelo Tosatti
kvm_x86_ops->test_posted_interrupt() returns true/false depending whether 'vector' is set. Next patch makes use of this interface. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08kvm: x86: vmx: NULL out hwapic_isr_update() in case of !enable_apicvTiejun Chen
In most cases calling hwapic_isr_update(), we always check if kvm_apic_vid_enabled() == 1, but actually, kvm_apic_vid_enabled() -> kvm_x86_ops->vm_has_apicv() -> vmx_vm_has_apicv() or '0' in svm case -> return enable_apicv && irqchip_in_kernel(kvm) So its a little cost to recall vmx_vm_has_apicv() inside hwapic_isr_update(), here just NULL out hwapic_isr_update() in case of !enable_apicv inside hardware_setup() then make all related stuffs follow this. Note we don't check this under that condition of irqchip_in_kernel() since we should make sure definitely any caller don't work without in-kernel irqchip. Signed-off-by: Tiejun Chen <tiejun.chen@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08KVM: nVMX: consult PFEC_MASK and PFEC_MATCH when generating #PF VM-exitEugene Korenevsky
When generating #PF VM-exit, check equality: (PFEC & PFEC_MASK) == PFEC_MATCH If there is equality, the 14 bit of exception bitmap is used to take decision about generating #PF VM-exit. If there is inequality, inverted 14 bit is used. Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08KVM: nVMX: Improve nested msr switch checkingEugene Korenevsky
This patch improve checks required by Intel Software Developer Manual. - SMM MSRs are not allowed. - microcode MSRs are not allowed. - check x2apic MSRs only when LAPIC is in x2apic mode. - MSR switch areas must be aligned to 16 bytes. - address of first and last byte in MSR switch areas should not set any bits beyond the processor's physical-address width. Also it adds warning messages on failures during MSR switch. These messages are useful for people who debug their VMMs in nVMX. Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08KVM: nVMX: Add nested msr load/restore algorithmWincy Van
Several hypervisors need MSR auto load/restore feature. We read MSRs from VM-entry MSR load area which specified by L1, and load them via kvm_set_msr in the nested entry. When nested exit occurs, we get MSRs via kvm_get_msr, writing them to L1`s MSR store area. After this, we read MSRs from VM-exit MSR load area, and load them via kvm_set_msr. Signed-off-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-27kvm: x86: vmx: reorder some msr writingTiejun Chen
The commit 34a1cd60d17f, "x86: vmx: move some vmx setting from vmx_init() to hardware_setup()", tried to refactor some codes specific to vmx hardware setting into hardware_setup(), but some msr writing should depend on our previous setting condition like enable_apicv, enable_ept and so on. Reported-by: Jamie Heilman <jamie@audible.transient.net> Tested-by: Jamie Heilman <jamie@audible.transient.net> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-11KVM: nVMX: Disable unrestricted mode if ept=0Bandan Das
If L0 has disabled EPT, don't advertise unrestricted mode at all since it depends on EPT to run real mode code. Fixes: 92fbc7b195b824e201d9f06f2b93105f72384d65 Cc: stable@vger.kernel.org Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-05kvm: vmx: add nested virtualization support for xsavesWanpeng Li
Add nested virtualization support for xsaves. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-05kvm: vmx: add MSR logic for XSAVESWanpeng Li
Add logic to get/set the XSS model-specific register. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-05kvm: x86: handle XSAVES vmcs and vmexitWanpeng Li
Initialize the XSS exit bitmap. It is zero so there should be no XSAVES or XRSTORS exits. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-05kvm: x86: Add kvm_x86_ops hook that enables XSAVES for guestWanpeng Li
Expose the XSAVES feature to the guest if the kvm_x86_ops say it is available. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-18kvm: x86: vmx: cleanup handle_ept_violationTiejun Chen
Instead, just use PFERR_{FETCH, PRESENT, WRITE}_MASK inside handle_ept_violation() for slightly better code. Signed-off-by: Tiejun Chen <tiejun.chen@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-12x86, kvm, vmx: Don't set LOAD_IA32_EFER when host and guest matchAndy Lutomirski
There's nothing to switch if the host and guest values are the same. I am unable to find evidence that this makes any difference whatsoever. Signed-off-by: Andy Lutomirski <luto@amacapital.net> [I could see a difference on Nehalem. From 5 runs: userspace exit, guest!=host 12200 11772 12130 12164 12327 userspace exit, guest=host 11983 11780 11920 11919 12040 lightweight exit, guest!=host 3214 3220 3238 3218 3337 lightweight exit, guest=host 3178 3193 3193 3187 3220 This passes the t-test with 99% confidence for userspace exit, 98.5% confidence for lightweight exit. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-12x86, kvm, vmx: Always use LOAD_IA32_EFER if availableAndy Lutomirski
At least on Sandy Bridge, letting the CPU switch IA32_EFER is much faster than switching it manually. I benchmarked this using the vmexit kvm-unit-test (single run, but GOAL multiplied by 5 to do more iterations): Test Before After Change cpuid 2000 1932 -3.40% vmcall 1914 1817 -5.07% mov_from_cr8 13 13 0.00% mov_to_cr8 19 19 0.00% inl_from_pmtimer 19164 10619 -44.59% inl_from_qemu 15662 10302 -34.22% inl_from_kernel 3916 3802 -2.91% outl_to_kernel 2230 2194 -1.61% mov_dr 172 176 2.33% ipi (skipped) (skipped) ipi+halt (skipped) (skipped) ple-round-robin 13 13 0.00% wr_tsc_adjust_msr 1920 1845 -3.91% rd_tsc_adjust_msr 1892 1814 -4.12% mmio-no-eventfd:pci-mem 16394 11165 -31.90% mmio-wildcard-eventfd:pci-mem 4607 4645 0.82% mmio-datamatch-eventfd:pci-mem 4601 4610 0.20% portio-no-eventfd:pci-io 11507 7942 -30.98% portio-wildcard-eventfd:pci-io 2239 2225 -0.63% portio-datamatch-eventfd:pci-io 2250 2234 -0.71% I haven't explicitly computed the significance of these numbers, but this isn't subtle. Signed-off-by: Andy Lutomirski <luto@amacapital.net> [The results were reproducible on all of Nehalem, Sandy Bridge and Ivy Bridge. The slowness of manual switching is because writing to EFER with WRMSR triggers a TLB flush, even if the only bit you're touching is SCE (so the page table format is not affected). Doing the write as part of vmentry/vmexit, instead, does not flush the TLB, probably because all processors that have EPT also have VPID. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07KVM: x86: Breakpoints do not consider CS.baseNadav Amit
x86 debug registers hold a linear address. Therefore, breakpoints detection should consider CS.base, and check whether instruction linear address equals (CS.base + RIP). This patch introduces a function to evaluate RIP linear address and uses it for breakpoints detection. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07KVM: x86: Clear DR6[0:3] on #DB during handle_drNadav Amit
DR6[0:3] (previous breakpoint indications) are cleared when #DB is injected during handle_exception, just as real hardware does. Similarily, handle_dr should clear DR6[0:3]. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07KVM: x86: reset RVI upon system resetWei Wang
A bug was reported as follows: when running Windows 7 32-bit guests on qemu-kvm, sometimes the guests run into blue screen during reboot. The problem was that a guest's RVI was not cleared when it rebooted. This patch has fixed the problem. Signed-off-by: Wei Wang <wei.w.wang@intel.com> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com> Tested-by: Rongrong Liu <rongrongx.liu@intel.com>, Da Chun <ngugc@qq.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07kvm: x86: vmx: avoid returning bool to distinguish success from errorPaolo Bonzini
Return a negative error code instead, and WARN() when we should be covering the entire 2-bit space of vmcs_field_type's return value. For increased robustness, add a BUILD_BUG_ON checking the range of vmcs_field_to_offset. Suggested-by: Tiejun Chen <tiejun.chen@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07kvm: x86: vmx: move some vmx setting from vmx_init() to hardware_setup()Tiejun Chen
Instead of vmx_init(), actually it would make reasonable sense to do anything specific to vmx hardware setting in vmx_x86_ops->hardware_setup(). Signed-off-by: Tiejun Chen <tiejun.chen@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07kvm: x86: vmx: move down hardware_setup() and hardware_unsetup()Tiejun Chen
Just move this pair of functions down to make sure later we can add something dependent on others. Signed-off-by: Tiejun Chen <tiejun.chen@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03KVM: vmx: Unavailable DR4/5 is checked before CPLNadav Amit
If DR4/5 is accessed when it is unavailable (since CR4.DE is set), then #UD should be generated even if CPL>0. This is according to Intel SDM Table 6-2: "Priority Among Simultaneous Exceptions and Interrupts". Note, that this may happen on the first DR access, even if the host does not sets debug breakpoints. Obviously, it occurs when the host debugs the guest. This patch moves the DR4/5 checks from __kvm_set_dr/_kvm_get_dr to handle_dr. The emulator already checks DR4/5 availability in check_dr_read. Nested virutalization related calls to kvm_set_dr/kvm_get_dr would not like to inject exceptions to the guest. As for SVM, the patch follows the previous logic as much as possible. Anyhow, it appears the DR interception code might be buggy - even if the DR access may cause an exception, the instruction is skipped. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03KVM: x86: Clear DR7.LE during task-switchNadav Amit
DR7.LE should be cleared during task-switch. This feature is poorly documented. For reference, see: http://pdos.csail.mit.edu/6.828/2005/readings/i386/s12_02.htm SDM [17.2.4]: This feature is not supported in the P6 family processors, later IA-32 processors, and Intel 64 processors. AMD [2:13.1.1.4]: This bit is ignored by implementations of the AMD64 architecture. Intel's formulation could mean that it isn't even zeroed, but current hardware indeed does not behave like that. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03KVM: x86: DR7.GD should be cleared upon any #DB exceptionNadav Amit
Intel SDM 17.2.4 (Debug Control Register (DR7)) says: "The processor clears the GD flag upon entering to the debug exception handler." This sentence may be misunderstood as if it happens only on #DB due to debug-register protection, but it happens regardless to the cause of the #DB. Fix the behavior to match both real hardware and Bochs. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03x86,kvm,vmx: Don't trap writes to CR4.TSDAndy Lutomirski
CR4.TSD is guest-owned; don't trap writes to it in VMX guests. This avoids a VM exit on context switches into or out of a PR_TSC_SIGSEGV task. I think that this fixes an unintentional side-effect of: 4c38609ac569 KVM: VMX: Make guest cr4 mask more conservative Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-02KVM: vmx: defer load of APIC access page address during resetPaolo Bonzini
Most call paths to vmx_vcpu_reset do not hold the SRCU lock. Defer loading the APIC access page to the next vmentry. This avoids the following lockdep splat: [ INFO: suspicious RCU usage. ] 3.18.0-rc2-test2+ #70 Not tainted ------------------------------- include/linux/kvm_host.h:474 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 1 lock held by qemu-system-x86/2371: #0: (&vcpu->mutex){+.+...}, at: [<ffffffffa037d800>] vcpu_load+0x20/0xd0 [kvm] stack backtrace: CPU: 4 PID: 2371 Comm: qemu-system-x86 Not tainted 3.18.0-rc2-test2+ #70 Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013 0000000000000001 ffff880209983ca8 ffffffff816f514f 0000000000000000 ffff8802099b8990 ffff880209983cd8 ffffffff810bd687 00000000000fee00 ffff880208a2c000 ffff880208a10000 ffff88020ef50040 ffff880209983d08 Call Trace: [<ffffffff816f514f>] dump_stack+0x4e/0x71 [<ffffffff810bd687>] lockdep_rcu_suspicious+0xe7/0x120 [<ffffffffa037d055>] gfn_to_memslot+0xd5/0xe0 [kvm] [<ffffffffa03807d3>] __gfn_to_pfn+0x33/0x60 [kvm] [<ffffffffa0380885>] gfn_to_page+0x25/0x90 [kvm] [<ffffffffa038aeec>] kvm_vcpu_reload_apic_access_page+0x3c/0x80 [kvm] [<ffffffffa08f0a9c>] vmx_vcpu_reset+0x20c/0x460 [kvm_intel] [<ffffffffa039ab8e>] kvm_vcpu_reset+0x15e/0x1b0 [kvm] [<ffffffffa039ac0c>] kvm_arch_vcpu_setup+0x2c/0x50 [kvm] [<ffffffffa037f7e0>] kvm_vm_ioctl+0x1d0/0x780 [kvm] [<ffffffff810bc664>] ? __lock_is_held+0x54/0x80 [<ffffffff812231f0>] do_vfs_ioctl+0x300/0x520 [<ffffffff8122ee45>] ? __fget+0x5/0x250 [<ffffffff8122f0fa>] ? __fget_light+0x2a/0xe0 [<ffffffff81223491>] SyS_ioctl+0x81/0xa0 [<ffffffff816fed6d>] system_call_fastpath+0x16/0x1b Reported-by: Takashi Iwai <tiwai@suse.de> Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Reviewed-by: Wanpeng Li <wanpeng.li@linux.intel.com> Tested-by: Wanpeng Li <wanpeng.li@linux.intel.com> Fixes: 38b9917350cb2946e368ba684cfc33d1672f104e Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-02KVM: nVMX: Disable preemption while reading from shadow VMCSJan Kiszka
In order to access the shadow VMCS, we need to load it. At this point, vmx->loaded_vmcs->vmcs and the actually loaded one start to differ. If we now get preempted by Linux, vmx_vcpu_put and, on return, the vmx_vcpu_load will work against the wrong vmcs. That can cause copy_shadow_to_vmcs12 to corrupt the vmcs12 state. Fix the issue by disabling preemption during the copy operation. copy_vmcs12_to_shadow is safe from this issue as it is executed by vmx_vcpu_run when preemption is already disabled before vmentry. This bug is exposed by running Jailhouse within KVM on CPUs with shadow VMCS support. Jailhouse never expects an interrupt pending vmexit, but the bug can cause it if, after copy_shadow_to_vmcs12 is preempted, the active VMCS happens to have the virtual interrupt pending flag set in the CPU-based execution controls. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-10-24kvm: x86: don't kill guest on unknown exit reasonMichael S. Tsirkin
KVM_EXIT_UNKNOWN is a kvm bug, we don't really know whether it was triggered by a priveledged application. Let's not kill the guest: WARN and inject #UD instead. Cc: stable@vger.kernel.org Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-10-24kvm: vmx: handle invvpid vm exit gracefullyPetr Matousek
On systems with invvpid instruction support (corresponding bit in IA32_VMX_EPT_VPID_CAP MSR is set) guest invocation of invvpid causes vm exit, which is currently not handled and results in propagation of unknown exit to userspace. Fix this by installing an invvpid vm exit handler. This is CVE-2014-3646. Cc: stable@vger.kernel.org Signed-off-by: Petr Matousek <pmatouse@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-10-24KVM: x86: Prevent host from panicking on shared MSR writes.Andy Honig
The previous patch blocked invalid writes directly when the MSR is written. As a precaution, prevent future similar mistakes by gracefulling handle GPs caused by writes to shared MSRs. Cc: stable@vger.kernel.org Signed-off-by: Andrew Honig <ahonig@google.com> [Remove parts obsoleted by Nadav's patch. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-10-24KVM: x86: Check non-canonical addresses upon WRMSRNadav Amit
Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is written to certain MSRs. The behavior is "almost" identical for AMD and Intel (ignoring MSRs that are not implemented in either architecture since they would anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if non-canonical address is written on Intel but not on AMD (which ignores the top 32-bits). Accordingly, this patch injects a #GP on the MSRs which behave identically on Intel and AMD. To eliminate the differences between the architecutres, the value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to canonical value before writing instead of injecting a #GP. Some references from Intel and AMD manuals: According to Intel SDM description of WRMSR instruction #GP is expected on WRMSR "If the source register contains a non-canonical address and ECX specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE, IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP." According to AMD manual instruction manual: LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the LSTAR and CSTAR registers. If an RIP written by WRMSR is not in canonical form, a general-protection exception (#GP) occurs." IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the base field must be in canonical form or a #GP fault will occur." IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must be in canonical form." This patch fixes CVE-2014-3610. Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-10-18x86,kvm,vmx: Preserve CR4 across VM entryAndy Lutomirski
CR4 isn't constant; at least the TSD and PCE bits can vary. TBH, treating CR0 and CR3 as constant scares me a bit, too, but it looks like it's correct. This adds a branch and a read from cr4 to each vm entry. Because it is extremely likely that consecutive entries into the same vcpu will have the same host cr4 value, this fixes up the vmcs instead of restoring cr4 after the fact. A subsequent patch will add a kernel-wide cr4 shadow, reducing the overhead in the common case to just two memory reads and a branch. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Cc: Petr Matousek <pmatouse@redhat.com> Cc: Gleb Natapov <gleb@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>