linux-toradex.git/arch, branch v3.0.74

x86, mm: Patch out arch_flush_lazy_mmu_mode() when running on bare metal

2013-04-17T04:02:31+00:00

commit 511ba86e1d386f671084b5d0e6f110bb30b8eeb2 upstream.

Invoking arch_flush_lazy_mmu_mode() results in calls to
preempt_enable()/disable() which may have performance impact.

Since lazy MMU is not used on bare metal we can patch away
arch_flush_lazy_mmu_mode() so that it is never called in such
environment.

[ hpa: the previous patch "Fix vmalloc_fault oops during lazy MMU
  updates" may cause a minor performance regression on
  bare metal.  This patch resolves that performance regression.  It is
  somewhat unclear to me if this is a good -stable candidate. ]

Signed-off-by: Boris Ostrovsky 
Link: http://lkml.kernel.org/r/1364045796-10720-2-git-send-email-konrad.wilk@oracle.com
Tested-by: Josh Boyer 
Tested-by: Konrad Rzeszutek Wilk 
Acked-by: Borislav Petkov 
Signed-off-by: Konrad Rzeszutek Wilk 
Signed-off-by: H. Peter Anvin 
Signed-off-by: Greg Kroah-Hartman

x86, mm, paravirt: Fix vmalloc_fault oops during lazy MMU updates

2013-04-17T04:02:31+00:00

commit 1160c2779b826c6f5c08e5cc542de58fd1f667d5 upstream.

In paravirtualized x86_64 kernels, vmalloc_fault may cause an oops
when lazy MMU updates are enabled, because set_pgd effects are being
deferred.

One instance of this problem is during process mm cleanup with memory
cgroups enabled. The chain of events is as follows:

- zap_pte_range enables lazy MMU updates
- zap_pte_range eventually calls mem_cgroup_charge_statistics,
  which accesses the vmalloc'd mem_cgroup per-cpu stat area
- vmalloc_fault is triggered which tries to sync the corresponding
  PGD entry with set_pgd, but the update is deferred
- vmalloc_fault oopses due to a mismatch in the PUD entries

The OOPs usually looks as so:

------------[ cut here ]------------
kernel BUG at arch/x86/mm/fault.c:396!
invalid opcode: 0000 [#1] SMP
.. snip ..
CPU 1
Pid: 10866, comm: httpd Not tainted 3.6.10-4.fc18.x86_64 #1
RIP: e030:[]  [] vmalloc_fault+0x11f/0x208
.. snip ..
Call Trace:
 [] do_page_fault+0x399/0x4b0
 [] ? xen_mc_extend_args+0xec/0x110
 [] page_fault+0x25/0x30
 [] ? mem_cgroup_charge_statistics.isra.13+0x13/0x50
 [] __mem_cgroup_uncharge_common+0xd8/0x350
 [] mem_cgroup_uncharge_page+0x57/0x60
 [] page_remove_rmap+0xe0/0x150
 [] ? vm_normal_page+0x1a/0x80
 [] unmap_single_vma+0x531/0x870
 [] unmap_vmas+0x52/0xa0
 [] ? pte_mfn_to_pfn+0x72/0x100
 [] exit_mmap+0x98/0x170
 [] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [] mmput+0x83/0xf0
 [] exit_mm+0x104/0x130
 [] do_exit+0x15a/0x8c0
 [] do_group_exit+0x3f/0xa0
 [] sys_exit_group+0x17/0x20
 [] system_call_fastpath+0x16/0x1b

Calling arch_flush_lazy_mmu_mode immediately after set_pgd makes the
changes visible to the consistency checks.

RedHat-Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=914737
Tested-by: Josh Boyer 
Reported-and-Tested-by: Krishna Raman 
Signed-off-by: Samu Kallio 
Link: http://lkml.kernel.org/r/1364045796-10720-1-git-send-email-konrad.wilk@oracle.com
Tested-by: Konrad Rzeszutek Wilk 
Signed-off-by: Konrad Rzeszutek Wilk 
Signed-off-by: H. Peter Anvin 
Signed-off-by: Greg Kroah-Hartman

x86-32, mm: Rip out x86_32 NUMA remapping code

2013-04-17T04:02:30+00:00

commit f03574f2d5b2d6229dcdf2d322848065f72953c7 upstream.

[was already included in 3.0, but I missed the patch hunk for 
 arch/x86/mm/numa_32.c  - gregkh]

This code was an optimization for 32-bit NUMA systems.

It has probably been the cause of a number of subtle bugs over
the years, although the conditions to excite them would have
been hard to trigger.  Essentially, we remap part of the kernel
linear mapping area, and then sometimes part of that area gets
freed back in to the bootmem allocator.  If those pages get
used by kernel data structures (say mem_map[] or a dentry),
there's no big deal.  But, if anyone ever tried to use the
linear mapping for these pages _and_ cared about their physical
address, bad things happen.

For instance, say you passed __GFP_ZERO to the page allocator
and then happened to get handed one of these pages, it zero the
remapped page, but it would make a pte to the _old_ page.
There are probably a hundred other ways that it could screw
with things.

We don't need to hang on to performance optimizations for
these old boxes any more.  All my 32-bit NUMA systems are long
dead and buried, and I probably had access to more than most
people.

This code is causing real things to break today:

	https://lkml.org/lkml/2013/1/9/376

I looked in to actually fixing this, but it requires surgery
to way too much brittle code, as well as stuff like
per_cpu_ptr_to_phys().

[ hpa: Cc: this for -stable, since it is a memory corruption issue.
  However, an alternative is to simply mark NUMA as depends BROKEN
  rather than EXPERIMENTAL in the X86_32 subclause... ]

Link: http://lkml.kernel.org/r/20130131005616.1C79F411@kernel.stglabs.ibm.com
Signed-off-by: H. Peter Anvin 
Cc: Jiri Slaby 
Signed-off-by: Greg Kroah-Hartman

x86-32, mm: Rip out x86_32 NUMA remapping code

2013-04-12T16:18:10+00:00

commit f03574f2d5b2d6229dcdf2d322848065f72953c7 upstream.

This code was an optimization for 32-bit NUMA systems.

It has probably been the cause of a number of subtle bugs over
the years, although the conditions to excite them would have
been hard to trigger.  Essentially, we remap part of the kernel
linear mapping area, and then sometimes part of that area gets
freed back in to the bootmem allocator.  If those pages get
used by kernel data structures (say mem_map[] or a dentry),
there's no big deal.  But, if anyone ever tried to use the
linear mapping for these pages _and_ cared about their physical
address, bad things happen.

For instance, say you passed __GFP_ZERO to the page allocator
and then happened to get handed one of these pages, it zero the
remapped page, but it would make a pte to the _old_ page.
There are probably a hundred other ways that it could screw
with things.

We don't need to hang on to performance optimizations for
these old boxes any more.  All my 32-bit NUMA systems are long
dead and buried, and I probably had access to more than most
people.

This code is causing real things to break today:

	https://lkml.org/lkml/2013/1/9/376

I looked in to actually fixing this, but it requires surgery
to way too much brittle code, as well as stuff like
per_cpu_ptr_to_phys().

[ hpa: Cc: this for -stable, since it is a memory corruption issue.
  However, an alternative is to simply mark NUMA as depends BROKEN
  rather than EXPERIMENTAL in the X86_32 subclause... ]

Link: http://lkml.kernel.org/r/20130131005616.1C79F411@kernel.stglabs.ibm.com
Signed-off-by: H. Peter Anvin 
Cc: Jiri Slaby 
Signed-off-by: Greg Kroah-Hartman

powerpc: pSeries_lpar_hpte_remove fails from Adjunct partition being performed before the ANDCOND test

2013-04-12T16:18:09+00:00

commit 9fb2640159f9d4f5a2a9d60e490482d4cbecafdb upstream.

Some versions of pHyp will perform the adjunct partition test before the
ANDCOND test.  The result of this is that H_RESOURCE can be returned and
cause the BUG_ON condition to occur. The HPTE is not removed.  So add a
check for H_RESOURCE, it is ok if this HPTE is not removed as
pSeries_lpar_hpte_remove is looking for an HPTE to remove and not a
specific HPTE to remove.  So it is ok to just move on to the next slot
and try again.

Signed-off-by: Michael Wolf 
Signed-off-by: Stephen Rothwell 
Signed-off-by: Greg Kroah-Hartman

alpha: Add irongate_io to PCI bus resources

2013-04-12T16:18:09+00:00

commit aa8b4be3ac049c8b1df2a87e4d1d902ccfc1f7a9 upstream.

Fixes a NULL pointer dereference at boot on UP1500.

Reviewed-and-Tested-by: Matt Turner 
Signed-off-by: Jay Estabrook 
Signed-off-by: Matt Turner 
Signed-off-by: Michael Cree 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

iommu/amd: Make sure dma_ops are set for hotplug devices

2013-04-05T17:16:54+00:00

commit c2a2876e863356b092967ea62bebdb4dd663af80 upstream.

There is a bug introduced with commit 27c2127 that causes
devices which are hot unplugged and then hot-replugged to
not have per-device dma_ops set. This causes these devices
to not function correctly. Fixed with this patch.

Reported-by: Andreas Degert 
Signed-off-by: Joerg Roedel 
Signed-off-by: Greg Kroah-Hartman

KVM: x86: invalid opcode oops on SET_SREGS with OSXSAVE bit set (CVE-2012-4461)

2013-04-05T17:16:51+00:00

commit 6d1068b3a98519247d8ba4ec85cd40ac136dbdf9 upstream.

On hosts without the XSAVE support unprivileged local user can trigger
oops similar to the one below by setting X86_CR4_OSXSAVE bit in guest
cr4 register using KVM_SET_SREGS ioctl and later issuing KVM_RUN
ioctl.

invalid opcode: 0000 [#2] SMP
Modules linked in: tun ip6table_filter ip6_tables ebtable_nat ebtables
...
Pid: 24935, comm: zoog_kvm_monito Tainted: G      D      3.2.0-3-686-pae
EIP: 0060:[] EFLAGS: 00210246 CPU: 0
EIP is at kvm_arch_vcpu_ioctl_run+0x92a/0xd13 [kvm]
EAX: 00000001 EBX: 000f387e ECX: 00000000 EDX: 00000000
ESI: 00000000 EDI: 00000000 EBP: ef5a0060 ESP: d7c63e70
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process zoog_kvm_monito (pid: 24935, ti=d7c62000 task=ed84a0c0
task.ti=d7c62000)
Stack:
 00000001 f70a1200 f8b940a9 ef5a0060 00000000 00200202 f8769009 00000000
 ef5a0060 000f387e eda5c020 8722f9c8 00015bae 00000000 ed84a0c0 ed84a0c0
 c12bf02d 0000ae80 ef7f8740 fffffffb f359b740 ef5a0060 f8b85dc1 0000ae80
Call Trace:
 [] ? kvm_arch_vcpu_ioctl_set_sregs+0x2fe/0x308 [kvm]
...
 [] ? syscall_call+0x7/0xb
Code: 89 e8 e8 14 ee ff ff ba 00 00 04 00 89 e8 e8 98 48 ff ff 85 c0 74
1e 83 7d 48 00 75 18 8b 85 08 07 00 00 31 c9 8b 95 0c 07 00 00 <0f> 01
d1 c7 45 48 01 00 00 00 c7 45 1c 01 00 00 00 0f ae f0 89
EIP: [] kvm_arch_vcpu_ioctl_run+0x92a/0xd13 [kvm] SS:ESP
0068:d7c63e70

QEMU first retrieves the supported features via KVM_GET_SUPPORTED_CPUID
and then sets them later. So guest's X86_FEATURE_XSAVE should be masked
out on hosts without X86_FEATURE_XSAVE, making kvm_set_cr4 with
X86_CR4_OSXSAVE fail. Userspaces that allow specifying guest cpuid with
X86_FEATURE_XSAVE even on hosts that do not support it, might be
susceptible to this attack from inside the guest as well.

Allow setting X86_CR4_OSXSAVE bit only if host has XSAVE support.

Signed-off-by: Petr Matousek 
Signed-off-by: Marcelo Tosatti 
Signed-off-by: Jiri Slaby 
Signed-off-by: Greg Kroah-Hartman

KVM: Ensure all vcpus are consistent with in-kernel irqchip settings

2013-04-05T17:16:38+00:00

commit 3e515705a1f46beb1c942bb8043c16f8ac7b1e9e upstream.

If some vcpus are created before KVM_CREATE_IRQCHIP, then
irqchip_in_kernel() and vcpu->arch.apic will be inconsistent, leading
to potential NULL pointer dereferences.

Fix by:
- ensuring that no vcpus are installed when KVM_CREATE_IRQCHIP is called
- ensuring that a vcpu has an apic if it is installed after KVM_CREATE_IRQCHIP

This is somewhat long winded because vcpu->arch.apic is created without
kvm->lock held.

Based on earlier patch by Michael Ellerman.

Signed-off-by: Michael Ellerman 
Signed-off-by: Avi Kivity 
Signed-off-by: Jiri Slaby 
Signed-off-by: Greg Kroah-Hartman

KVM: x86: Prevent starting PIT timers in the absence of irqchip support

2013-04-05T17:16:38+00:00

commit 0924ab2cfa98b1ece26c033d696651fd62896c69 upstream.

User space may create the PIT and forgets about setting up the irqchips.
In that case, firing PIT IRQs will crash the host:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000128
IP: [] kvm_set_irq+0x30/0x170 [kvm]
...
Call Trace:
 [] pit_do_work+0x51/0xd0 [kvm]
 [] process_one_work+0x111/0x4d0
 [] worker_thread+0x152/0x340
 [] kthread+0x7e/0x90
 [] kernel_thread_helper+0x4/0x10

Prevent this by checking the irqchip mode before starting a timer. We
can't deny creating the PIT if the irqchips aren't set up yet as
current user land expects this order to work.

Signed-off-by: Jan Kiszka 
Signed-off-by: Marcelo Tosatti 
Signed-off-by: Jiri Slaby 
Signed-off-by: Greg Kroah-Hartman