linux-toradex.git/arch/x86, branch v4.4.7

perf/x86/intel: Fix PEBS data source interpretation on Nehalem/Westmere

2016-04-12T16:09:06+00:00

commit e17dc65328057c00db7e1bfea249c8771a78b30b upstream.

Jiri reported some time ago that some entries in the PEBS data source table
in perf do not agree with the SDM. We investigated and the bits
changed for Sandy Bridge, but the SDM was not updated.

perf already implements the bits correctly for Sandy Bridge
and later. This patch patches it up for Nehalem and Westmere.

Signed-off-by: Andi Kleen 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: jolsa@kernel.org
Link: http://lkml.kernel.org/r/1456871124-15985-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

perf/x86/intel: Use PAGE_SIZE for PEBS buffer size on Core2

2016-04-12T16:09:06+00:00

commit e72daf3f4d764c47fb71c9bdc7f9c54a503825b1 upstream.

Using PAGE_SIZE buffers makes the WRMSR to PERF_GLOBAL_CTRL in
intel_pmu_enable_all() mysteriously hang on Core2. As a workaround, we
don't do this.

The hard lockup is easily triggered by running 'perf test attr'
repeatedly. Most of the time it gets stuck on sample session with
small periods.

  # perf test attr -vv
  14: struct perf_event_attr setup                             :
  --- start ---
  ...
    'PERF_TEST_ATTR=/tmp/tmpuEKz3B /usr/bin/perf record -o /tmp/tmpuEKz3B/perf.data -c 123 kill >/dev/null 2>&1' ret 1

Reported-by: Arnaldo Carvalho de Melo 
Signed-off-by: Jiri Olsa 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Andi Kleen 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Kan Liang 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: Wang Nan 
Link: http://lkml.kernel.org/r/20160301190352.GA8355@krava.redhat.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

perf/x86/intel: Fix PEBS warning by only restoring active PMU in pmi

2016-04-12T16:09:06+00:00

commit c3d266c8a9838cc141b69548bc3b1b18808ae8c4 upstream.

This patch tries to fix a PEBS warning found in my stress test. The
following perf command can easily trigger the pebs warning or spurious
NMI error on Skylake/Broadwell/Haswell platforms:

  sudo perf record -e 'cpu/umask=0x04,event=0xc4/pp,cycles,branches,ref-cycles,cache-misses,cache-references' --call-graph fp -b -c1000 -a

Also the NMI watchdog must be enabled.

For this case, the events number is larger than counter number. So
perf has to do multiplexing.

In perf_mux_hrtimer_handler, it does perf_pmu_disable(), schedule out
old events, rotate_ctx, schedule in new events and finally
perf_pmu_enable().

If the old events include precise event, the MSR_IA32_PEBS_ENABLE
should be cleared when perf_pmu_disable().  The MSR_IA32_PEBS_ENABLE
should keep 0 until the perf_pmu_enable() is called and the new event is
precise event.

However, there is a corner case which could restore PEBS_ENABLE to
stale value during the above period. In perf_pmu_disable(), GLOBAL_CTRL
will be set to 0 to stop overflow and followed PMI. But there may be
pending PMI from an earlier overflow, which cannot be stopped. So even
GLOBAL_CTRL is cleared, the kernel still be possible to get PMI. At
the end of the PMI handler, __intel_pmu_enable_all() will be called,
which will restore the stale values if old events haven't scheduled
out.

Once the stale pebs value is set, it's impossible to be corrected if
the new events are non-precise. Because the pebs_enabled will be set
to 0. x86_pmu.enable_all() will ignore the MSR_IA32_PEBS_ENABLE
setting. As a result, the following NMI with stale PEBS_ENABLE
trigger pebs warning.

The pending PMI after enabled=0 will become harmless if the NMI handler
does not change the state. This patch checks cpuc->enabled in pmi and
only restore the state when PMU is active.

Here is the dump:

  Call Trace:
     [] dump_stack+0x63/0x85
   [] warn_slowpath_common+0x82/0xc0
   [] warn_slowpath_null+0x1a/0x20
   [] intel_pmu_drain_pebs_nhm+0x2be/0x320
   [] intel_pmu_handle_irq+0x279/0x460
   [] ? native_write_msr_safe+0x6/0x40
   [] ? vunmap_page_range+0x20d/0x330
   [] ?  unmap_kernel_range_noflush+0x11/0x20
   [] ? ghes_copy_tofrom_phys+0x10f/0x2a0
   [] ? ghes_read_estatus+0x98/0x170
   [] perf_event_nmi_handler+0x2d/0x50
   [] nmi_handle+0x69/0x120
   [] default_do_nmi+0xe6/0x100
   [] do_nmi+0xe2/0x130
   [] end_repeat_nmi+0x1a/0x1e
   [] ? native_write_msr_safe+0x6/0x40
   [] ? native_write_msr_safe+0x6/0x40
   [] ? native_write_msr_safe+0x6/0x40
   <>    [] ?  x86_perf_event_set_period+0xd8/0x180
   [] x86_pmu_start+0x4c/0x100
   [] x86_pmu_enable+0x28d/0x300
   [] perf_pmu_enable.part.81+0x7/0x10
   [] perf_mux_hrtimer_handler+0x200/0x280
   [] ?  __perf_install_in_context+0xc0/0xc0
   [] __hrtimer_run_queues+0xfd/0x280
   [] hrtimer_interrupt+0xa8/0x190
   [] ?  __perf_read_group_add.part.61+0x1a0/0x1a0
   [] local_apic_timer_interrupt+0x38/0x60
   [] smp_apic_timer_interrupt+0x3d/0x50
   [] apic_timer_interrupt+0x8c/0xa0
     [] ?  __perf_read_group_add.part.61+0x1a0/0x1a0
   [] ?  smp_call_function_single+0xd5/0x130
   [] ?  smp_call_function_single+0xcb/0x130
   [] ?  __perf_read_group_add.part.61+0x1a0/0x1a0
   [] event_function_call+0x10a/0x120
   [] ? ctx_resched+0x90/0x90
   [] ? cpu_clock_event_read+0x30/0x30
   [] ? _perf_event_disable+0x60/0x60
   [] _perf_event_enable+0x5b/0x70
   [] perf_event_for_each_child+0x38/0xa0
   [] ? _perf_event_disable+0x60/0x60
   [] perf_ioctl+0x12d/0x3c0
   [] ? selinux_file_ioctl+0x95/0x1e0
   [] do_vfs_ioctl+0xa1/0x5a0
   [] ? sched_clock+0x9/0x10
   [] SyS_ioctl+0x79/0x90
   [] entry_SYSCALL_64_fastpath+0x1a/0xa4
  ---[ end trace aef202839fe9a71d ]---
  Uhhuh. NMI received for unknown reason 2d on CPU 2.
  Do you have a strange power saving mode enabled?

Signed-off-by: Kan Liang 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Alexander Shishkin 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Link: http://lkml.kernel.org/r/1457046448-6184-1-git-send-email-kan.liang@intel.com
[ Fixed various typos and other small details. ]
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

perf/x86/pebs: Add workaround for broken OVFL status on HSW+

2016-04-12T16:09:05+00:00

commit 8077eca079a212f26419c57226f28696b7100683 upstream.

This patch fixes an issue with the GLOBAL_OVERFLOW_STATUS bits on
Haswell, Broadwell and Skylake processors when using PEBS.

The SDM stipulates that when the PEBS iterrupt threshold is crossed,
an interrupt is posted and the kernel is interrupted. The kernel will
find GLOBAL_OVF_SATUS bit 62 set indicating there are PEBS records to
drain. But the bits corresponding to the actual counters should NOT be
set. The kernel follows the SDM and assumes that all PEBS events are
processed in the drain_pebs() callback. The kernel then checks for
remaining overflows on any other (non-PEBS) events and processes these
in the for_each_bit_set(&status) loop.

As it turns out, under certain conditions on HSW and later processors,
on PEBS buffer interrupt, bit 62 is set but the counter bits may be
set as well. In that case, the kernel drains PEBS and generates
SAMPLES with the EXACT tag, then it processes the counter bits, and
generates normal (non-EXACT) SAMPLES.

I ran into this problem by trying to understand why on HSW sampling on
a PEBS event was sometimes returning SAMPLES without the EXACT tag.
This should not happen on user level code because HSW has the
eventing_ip which always point to the instruction that caused the
event.

The workaround in this patch simply ensures that the bits for the
counters used for PEBS events are cleared after the PEBS buffer has
been drained. With this fix 100% of the PEBS samples on my user code
report the EXACT tag.

Before:
  $ perf record -e cpu/event=0xd0,umask=0x81/upp ./multichase
  $ perf report -D | fgrep SAMPLES
  PERF_RECORD_SAMPLE(IP, 0x2): 11775/11775: 0x406de5 period: 73469 addr: 0 exact=Y
                           \--- EXACT tag is missing

After:
  $ perf record -e cpu/event=0xd0,umask=0x81/upp ./multichase
  $ perf report -D | fgrep SAMPLES
  PERF_RECORD_SAMPLE(IP, 0x4002): 11775/11775: 0x406de5 period: 73469 addr: 0 exact=Y
                           \--- EXACT tag is set

The problem tends to appear more often when multiple PEBS events are used.

Signed-off-by: Stephane Eranian 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Alexander Shishkin 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: namhyung@kernel.org
Link: http://lkml.kernel.org/r/1457034642-21837-3-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

x86/mm: TLB_REMOTE_SEND_IPI should count pages

2016-04-12T16:08:38+00:00

commit 18c98243ddf05a1827ad2c359c5ac051101e7ff7 upstream.

TLB_REMOTE_SEND_IPI was recently introduced, but it counts bytes instead
of pages.  In addition, it does not report correctly the case in which
flush_tlb_page flushes a page.  Fix it to be consistent with other TLB
counters.

Fixes: 5b74283ab251b9d ("x86, mm: trace when an IPI is about to be sent")
Signed-off-by: Nadav Amit 
Cc: Mel Gorman 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Ingo Molnar 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

x86/iopl: Fix iopl capability check on Xen PV

2016-04-12T16:08:38+00:00

commit c29016cf41fe9fa994a5ecca607cf5f1cd98801e upstream.

iopl(3) is supposed to work if iopl is already 3, even if
unprivileged.  This didn't work right on Xen PV.  Fix it.

Reviewewd-by: Jan Beulich 
Signed-off-by: Andy Lutomirski 
Cc: Andrew Cooper 
Cc: Andy Lutomirski 
Cc: Boris Ostrovsky 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: David Vrabel 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Jan Beulich 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/8ce12013e6e4c0a44a97e316be4a6faff31bd5ea.1458162709.git.luto@kernel.org
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

x86/iopl/64: Properly context-switch IOPL on Xen PV

2016-04-12T16:08:38+00:00

commit b7a584598aea7ca73140cb87b40319944dd3393f upstream.

On Xen PV, regs->flags doesn't reliably reflect IOPL and the
exit-to-userspace code doesn't change IOPL.  We need to context
switch it manually.

I'm doing this without going through paravirt because this is
specific to Xen PV.  After the dust settles, we can merge this with
the 32-bit code, tidy up the iopl syscall implementation, and remove
the set_iopl pvop entirely.

Fixes XSA-171.

Reviewewd-by: Jan Beulich 
Signed-off-by: Andy Lutomirski 
Cc: Andrew Cooper 
Cc: Andy Lutomirski 
Cc: Boris Ostrovsky 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: David Vrabel 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Jan Beulich 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/693c3bd7aeb4d3c27c92c622b7d0f554a458173c.1458162709.git.luto@kernel.org
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

x86/apic: Fix suspicious RCU usage in smp_trace_call_function_interrupt()

2016-04-12T16:08:38+00:00

commit 7834c10313fb823e538f2772be78edcdeed2e6e3 upstream.

Since 4.4, I've been able to trigger this occasionally:

===============================
[ INFO: suspicious RCU usage. ]
4.5.0-rc7-think+ #3 Not tainted
Cc: Andi Kleen 
Link: http://lkml.kernel.org/r/20160315012054.GA17765@codemonkey.org.uk
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman 

-------------------------------
./arch/x86/include/asm/msr-trace.h:47 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

RCU used illegally from idle CPU!
rcu_scheduler_active = 1, debug_locks = 1
RCU used illegally from extended quiescent state!
no locks held by swapper/3/0.

stack backtrace:
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.5.0-rc7-think+ #3
 ffffffff92f821e0 1f3e5c340597d7fc ffff880468e07f10 ffffffff92560c2a
 ffff880462145280 0000000000000001 ffff880468e07f40 ffffffff921376a6
 ffffffff93665ea0 0000cc7c876d28da 0000000000000005 ffffffff9383dd60
Call Trace:
   [] dump_stack+0x67/0x9d
 [] lockdep_rcu_suspicious+0xe6/0x100
 [] do_trace_write_msr+0x127/0x1a0
 [] native_apic_msr_eoi_write+0x23/0x30
 [] smp_trace_call_function_interrupt+0x38/0x360
 [] trace_call_function_interrupt+0x90/0xa0
   [] ? cpuidle_enter_state+0x1b4/0x520

Move the entering_irq() call before ack_APIC_irq(), because entering_irq()
tells the RCU susbstems to end the extended quiescent state, so that the
following trace call in ack_APIC_irq() works correctly.

Suggested-by: Andi Kleen 
Fixes: 4787c368a9bc "x86/tracing: Add irq_enter/exit() in smp_trace_reschedule_interrupt()"
Signed-off-by: Dave Jones 
Signed-off-by: Thomas Gleixner

x86/irq: Cure live lock in fixup_irqs()

2016-04-12T16:08:37+00:00

commit 551adc60573cb68e3d55cacca9ba1b7437313df7 upstream.

Harry reported, that he's able to trigger a system freeze with cpu hot
unplug. The freeze turned out to be a live lock caused by recent changes in
irq_force_complete_move().

When fixup_irqs() and from there irq_force_complete_move() is called on the
dying cpu, then all other cpus are in stop machine an wait for the dying cpu
to complete the teardown. If there is a move of an interrupt pending then
irq_force_complete_move() sends the cleanup IPI to the cpus in the old_domain
mask and waits for them to clear the mask. That's obviously impossible as
those cpus are firmly stuck in stop machine with interrupts disabled.

I should have known that, but I completely overlooked it being concentrated on
the locking issues around the vectors. And the existance of the call to
__irq_complete_move() in the code, which actually sends the cleanup IPI made
it reasonable to wait for that cleanup to complete. That call was bogus even
before the recent changes as it was just a pointless distraction.

We have to look at two cases:

1) The move_in_progress flag of the interrupt is set

   This means the ioapic has been updated with the new vector, but it has not
   fired yet. In theory there is a race:

   set_ioapic(new_vector) <-- Interrupt is raised before update is effective,
   			      i.e. it's raised on the old vector.

   So if the target cpu cannot handle that interrupt before the old vector is
   cleaned up, we get a spurious interrupt and in the worst case the ioapic
   irq line becomes stale, but my experiments so far have only resulted in
   spurious interrupts.

   But in case of cpu hotplug this should be a non issue because if the
   affinity update happens right before all cpus rendevouz in stop machine,
   there is no way that the interrupt can be blocked on the target cpu because
   all cpus loops first with interrupts enabled in stop machine, so the old
   vector is not yet cleaned up when the interrupt fires.

   So the only way to run into this issue is if the delivery of the interrupt
   on the apic/system bus would be delayed beyond the point where the target
   cpu disables interrupts in stop machine. I doubt that it can happen, but at
   least there is a theroretical chance. Virtualization might be able to
   expose this, but AFAICT the IOAPIC emulation is not as stupid as the real
   hardware.

   I've spent quite some time over the weekend to enforce that situation,
   though I was not able to trigger the delayed case.

2) The move_in_progress flag is not set and the old_domain cpu mask is not
   empty.

   That means, that an interrupt was delivered after the change and the
   cleanup IPI has been sent to the cpus in old_domain, but not all CPUs have
   responded to it yet.

In both cases we can assume that the next interrupt will arrive on the new
vector, so we can cleanup the old vectors on the cpus in the old_domain cpu
mask.

Fixes: 98229aa36caa "x86/irq: Plug vector cleanup race"
Reported-by: Harry Junior 
Tested-by: Tony Luck 
Signed-off-by: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Joe Lawrence 
Cc: Borislav Petkov 
Cc: Ben Hutchings 
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603140931430.3657@nanos
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman

KVM: VMX: fix nested vpid for old KVM guests

2016-04-12T16:08:34+00:00

commit ef697a712a6165aea7779c295604b099e8bfae2e upstream.

Old KVM guests invoke single-context invvpid without actually checking
whether it is supported.  This was fixed by commit 518c8ae ("KVM: VMX:
Make sure single type invvpid is supported before issuing invvpid
instruction", 2010-08-01) and the patch after, but pre-2.6.36
kernels lack it including RHEL 6.

Reported-by: jmontleo@redhat.com
Tested-by: jmontleo@redhat.com
Fixes: 99b83ac893b84ed1a62ad6d1f2b6cc32026b9e85
Reviewed-by: David Matlack 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Greg Kroah-Hartman