linux-toradex.git/arch, branch v3.12.12

x86: mm: change tlb_flushall_shift for IvyBridge

2014-02-20T19:08:01+00:00

commit f98b7a772ab51b52ca4d2a14362fc0e0c8a2e0f3 upstream.

There was a large performance regression that was bisected to
commit 611ae8e3 ("x86/tlb: enable tlb flush range support for
x86").  This patch simply changes the default balance point
between a local and global flush for IvyBridge.

In the interest of allowing the tests to be reproduced, this
patch was tested using mmtests 0.15 with the following
configurations

	configs/config-global-dhp__tlbflush-performance
	configs/config-global-dhp__scheduler-performance
	configs/config-global-dhp__network-performance

Results are from two machines

Ivybridge   4 threads:  Intel(R) Core(TM) i3-3240 CPU @ 3.40GHz
Ivybridge   8 threads:  Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

Page fault microbenchmark showed nothing interesting.

Ebizzy was configured to run multiple iterations and threads.
Thread counts ranged from 1 to NR_CPUS*2. For each thread count,
it ran 100 iterations and each iteration lasted 10 seconds.

Ivybridge 4 threads
                    3.13.0-rc7            3.13.0-rc7
                       vanilla           altshift-v3
Mean   1     6395.44 (  0.00%)     6789.09 (  6.16%)
Mean   2     7012.85 (  0.00%)     8052.16 ( 14.82%)
Mean   3     6403.04 (  0.00%)     6973.74 (  8.91%)
Mean   4     6135.32 (  0.00%)     6582.33 (  7.29%)
Mean   5     6095.69 (  0.00%)     6526.68 (  7.07%)
Mean   6     6114.33 (  0.00%)     6416.64 (  4.94%)
Mean   7     6085.10 (  0.00%)     6448.51 (  5.97%)
Mean   8     6120.62 (  0.00%)     6462.97 (  5.59%)

Ivybridge 8 threads
                     3.13.0-rc7            3.13.0-rc7
                        vanilla           altshift-v3
Mean   1      7336.65 (  0.00%)     7787.02 (  6.14%)
Mean   2      8218.41 (  0.00%)     9484.13 ( 15.40%)
Mean   3      7973.62 (  0.00%)     8922.01 ( 11.89%)
Mean   4      7798.33 (  0.00%)     8567.03 (  9.86%)
Mean   5      7158.72 (  0.00%)     8214.23 ( 14.74%)
Mean   6      6852.27 (  0.00%)     7952.45 ( 16.06%)
Mean   7      6774.65 (  0.00%)     7536.35 ( 11.24%)
Mean   8      6510.50 (  0.00%)     6894.05 (  5.89%)
Mean   12     6182.90 (  0.00%)     6661.29 (  7.74%)
Mean   16     6100.09 (  0.00%)     6608.69 (  8.34%)

Ebizzy hits the worst case scenario for TLB range flushing every
time and it shows for these Ivybridge CPUs at least that the
default choice is a poor on.  The patch addresses the problem.

Next was a tlbflush microbenchmark written by Alex Shi at
http://marc.info/?l=linux-kernel&m=133727348217113 .  It
measures access costs while the TLB is being flushed.  The
expectation is that if there are always full TLB flushes that
the benchmark would suffer and it benefits from range flushing

There are 320 iterations of the test per thread count.  The
number of entries is randomly selected with a min of 1 and max
of 512.  To ensure a reasonably even spread of entries, the full
range is broken up into 8 sections and a random number selected
within that section.

iteration 1, random number between 0-64
iteration 2, random number between 64-128 etc

This is still a very weak methodology.  When you do not know
what are typical ranges, random is a reasonable choice but it
can be easily argued that the opimisation was for smaller ranges
and an even spread is not representative of any workload that
matters.  To improve this, we'd need to know the probability
distribution of TLB flush range sizes for a set of workloads
that are considered "common", build a synthetic trace and feed
that into this benchmark.  Even that is not perfect because it
would not account for the time between flushes but there are
limits of what can be reasonably done and still be doing
something useful.  If a representative synthetic trace is
provided then this benchmark could be revisited and the shift values retuned.

Ivybridge 4 threads
                        3.13.0-rc7            3.13.0-rc7
                           vanilla           altshift-v3
Mean       1       10.50 (  0.00%)       10.50 (  0.03%)
Mean       2       17.59 (  0.00%)       17.18 (  2.34%)
Mean       3       22.98 (  0.00%)       21.74 (  5.41%)
Mean       5       47.13 (  0.00%)       46.23 (  1.92%)
Mean       8       43.30 (  0.00%)       42.56 (  1.72%)

Ivybridge 8 threads
                         3.13.0-rc7            3.13.0-rc7
                            vanilla           altshift-v3
Mean       1         9.45 (  0.00%)        9.36 (  0.93%)
Mean       2         9.37 (  0.00%)        9.70 ( -3.54%)
Mean       3         9.36 (  0.00%)        9.29 (  0.70%)
Mean       5        14.49 (  0.00%)       15.04 ( -3.75%)
Mean       8        41.08 (  0.00%)       38.73 (  5.71%)
Mean       13       32.04 (  0.00%)       31.24 (  2.49%)
Mean       16       40.05 (  0.00%)       39.04 (  2.51%)

For both CPUs, average access time is reduced which is good as
this is the benchmark that was used to tune the shift values in
the first place albeit it is now known *how* the benchmark was
used.

The scheduler benchmarks were somewhat inconclusive.  They
showed gains and losses and makes me reconsider how stable those
benchmarks really are or if something else might be interfering
with the test results recently.

Network benchmarks were inconclusive.  Almost all results were
flat except for netperf-udp tests on the 4 thread machine.
These results were unstable and showed large variations between
reboots.  It is unknown if this is a recent problems but I've
noticed before that netperf-udp results tend to vary.

Based on these results, changing the default for Ivybridge seems
like a logical choice.

Signed-off-by: Mel Gorman 
Tested-by: Davidlohr Bueso 
Reviewed-by: Alex Shi 
Reviewed-by: Rik van Riel 
Signed-off-by: Andrew Morton 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Link: http://lkml.kernel.org/n/tip-cqnadffh1tiqrshthRj3Esge@git.kernel.org
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

arm64: add DSB after icache flush in __flush_icache_all()

2014-02-20T19:07:59+00:00

commit 5044bad43ee573d0b6d90e3ccb7a40c2c7d25eb4 upstream.

Add DSB after icache flush to complete the cache maintenance operation.
The function __flush_icache_all() is used only for user space mappings
and an ISB is not required because of an exception return before executing
user instructions. An exception return would behave like an ISB.

Signed-off-by: Vinayak Kale 
Acked-by: Will Deacon 
Signed-off-by: Catalin Marinas 
Signed-off-by: Greg Kroah-Hartman

arm64: vdso: fix coarse clock handling

2014-02-20T19:07:59+00:00

commit 069b918623e1510e58dacf178905a72c3baa3ae4 upstream.

When __kernel_clock_gettime is called with a CLOCK_MONOTONIC_COARSE or
CLOCK_REALTIME_COARSE clock id, it returns incorrectly to whatever the
caller has placed in x2 ("ret x2" to return from the fast path).  Fix
this by saving x30/LR to x2 only in code that will call
__do_get_tspec, restoring x30 afterward, and using a plain "ret" to
return from the routine.

Also: while the resulting tv_nsec value for CLOCK_REALTIME and
CLOCK_MONOTONIC must be computed using intermediate values that are
left-shifted by cs_shift (x12, set by __do_get_tspec), the results for
coarse clocks should be calculated using unshifted values
(xtime_coarse_nsec is in units of actual nanoseconds).  The current
code shifts intermediate values by x12 unconditionally, but x12 is
uninitialized when servicing a coarse clock.  Fix this by setting x12
to 0 once we know we are dealing with a coarse clock id.

Signed-off-by: Nathan Lynch 
Acked-by: Will Deacon 
Signed-off-by: Catalin Marinas 
Signed-off-by: Greg Kroah-Hartman

arm64: Invalidate the TLB when replacing pmd entries during boot

2014-02-20T19:07:59+00:00

commit a55f9929a9b257f84b6cc7b2397379cabd744a22 upstream.

With the 64K page size configuration, __create_page_tables in head.S
maps enough memory to get started but using 64K pages rather than 512M
sections with a single pgd/pud/pmd entry pointing to a pte table.
create_mapping() may override the pgd/pud/pmd table entry with a block
(section) one if the RAM size is more than 512MB and aligned correctly.
For the end of this block to be accessible, the old TLB entry must be
invalidated.

Reported-by: Mark Salter 
Tested-by: Mark Salter 
Signed-off-by: Catalin Marinas 
Signed-off-by: Greg Kroah-Hartman

arm64: vdso: prevent ld from aligning PT_LOAD segments to 64k

2014-02-20T19:07:59+00:00

commit 40507403485fcb56b83d6ddfc954e9b08305054c upstream.

Whilst the text segment for our VDSO is marked as PT_LOAD in the ELF
headers, it is mapped by the kernel and not actually subject to
demand-paging. ld doesn't realise this, and emits a p_align field of 64k
(the maximum supported page size), which conflicts with the load address
picked by the kernel on 4k systems, which will be 4k aligned. This
causes GDB to fail with "Failed to read a valid object file image from
memory" when attempting to load the VDSO.

This patch passes the -n option to ld, which prevents it from aligning
PT_LOAD segments to the maximum page size.

Reported-by: Kyle McMartin 
Acked-by: Kyle McMartin 
Signed-off-by: Will Deacon 
Signed-off-by: Catalin Marinas 
Signed-off-by: Greg Kroah-Hartman

arm64: vdso: update wtm fields for CLOCK_MONOTONIC_COARSE

2014-02-20T19:07:59+00:00

commit d4022a335271a48cce49df35d825897914fbffe3 upstream.

Update wall-to-monotonic fields in the VDSO data page
unconditionally.  These are used to service CLOCK_MONOTONIC_COARSE,
which is not guarded by use_syscall.

Signed-off-by: Nathan Lynch 
Acked-by: Will Deacon 
Signed-off-by: Catalin Marinas 
Signed-off-by: Greg Kroah-Hartman

crypto: s390 - fix des and des3_ede ctr concurrency issue

2014-02-20T19:07:58+00:00

commit ee97dc7db4cbda33e4241c2d85b42d1835bc8a35 upstream.

In s390 des and 3des ctr mode there is one preallocated page
used to speed up the en/decryption. This page is not protected
against concurrent usage and thus there is a potential of data
corruption with multiple threads.

The fix introduces locking/unlocking the ctr page and a slower
fallback solution at concurrency situations.

Signed-off-by: Harald Freudenberger 
Signed-off-by: Herbert Xu 
Signed-off-by: Greg Kroah-Hartman

crypto: s390 - fix des and des3_ede cbc concurrency issue

2014-02-20T19:07:58+00:00

commit adc3fcf1552b6e406d172fd9690bbd1395053d13 upstream.

In s390 des and des3_ede cbc mode the iv value is not protected
against concurrency access and modifications from another running
en/decrypt operation which is using the very same tfm struct
instance. This fix copies the iv to the local stack before
the crypto operation and stores the value back when done.

Signed-off-by: Harald Freudenberger 
Signed-off-by: Herbert Xu 
Signed-off-by: Greg Kroah-Hartman

crypto: s390 - fix concurrency issue in aes-ctr mode

2014-02-20T19:07:58+00:00

commit 0519e9ad89e5cd6e6b08398f57c6a71d9580564c upstream.

The aes-ctr mode uses one preallocated page without any concurrency
protection. When multiple threads run aes-ctr encryption or decryption
this can lead to data corruption.

The patch introduces locking for the page and a fallback solution with
slower en/decryption performance in concurrency situations.

Signed-off-by: Harald Freudenberger 
Signed-off-by: Herbert Xu 
Signed-off-by: Greg Kroah-Hartman

xtensa: xtfpga: fix definitions of platform devices

2014-02-13T21:50:14+00:00

commit a558d99263936b8a67d4eff8918745a77bfd8c31 upstream.

Remove __initdata attribute, as the devices may be used after init
sections are freed.

Signed-off-by: Max Filippov 
Signed-off-by: Greg Kroah-Hartman