linux-toradex.git/arch/powerpc/lib/Makefile, branch v3.17

powerpc: memcpy optimization for 64bit LE

2014-04-30T05:26:18+00:00

Unaligned stores take alignment exceptions on POWER7 running in little-endian.
This is a dumb little-endian base memcpy that prevents unaligned stores.
Once booted the feature fixup code switches over to the VMX copy loops
(which are already endian safe).

The question is what we do before that switch over. The base 64bit
memcpy takes alignment exceptions on POWER7 so we can't use it as is.
Fixing the causes of alignment exception would slow it down, because
we'd need to ensure all loads and stores are aligned either through
rotate tricks or bytewise loads and stores. Either would be bad for
all other 64bit platforms.

[ I simplified the loop a bit - Anton ]

Signed-off-by: Philippe Bergheaud 
Signed-off-by: Anton Blanchard 
Signed-off-by: Benjamin Herrenschmidt

powerpc: Add VMX optimised xor for RAID5

2013-10-30T05:02:28+00:00

Add a VMX optimised xor, used primarily for RAID5. On a POWER7 blade
this is a decent win:

   32regs    : 17932.800 MB/sec
   altivec   : 19724.800 MB/sec

The bigger gain is when the same test is run in SMT4 mode, as it
would if there was a lot of work going on:

   8regs     :  8377.600 MB/sec
   altivec   : 15801.600 MB/sec

I tested this against an array created without the patch, and also
verified it worked as expected on a little endian kernel.

[ Fix !CONFIG_ALTIVEC build -- BenH ]

Signed-off-by: Anton Blanchard 
Signed-off-by: Benjamin Herrenschmidt

powerpc: Use generic memcpy code in little endian

2013-10-11T05:48:40+00:00

We need to fix some endian issues in our memcpy code. For now
just enable the generic memcpy routine for little endian builds.

Signed-off-by: Anton Blanchard 
Signed-off-by: Benjamin Herrenschmidt

powerpc: Use generic checksum code in little endian

2013-10-11T05:48:39+00:00

We need to fix some endian issues in our checksum code. For now
just enable the generic checksum routines for little endian builds.

Signed-off-by: Anton Blanchard 
Signed-off-by: Benjamin Herrenschmidt

uprobes/powerpc: Add dependency on single step emulation

2013-01-29T00:35:06+00:00

Uprobes uses emulate_step in sstep.c, but we haven't explicitly specified
the dependency. On pseries HAVE_HW_BREAKPOINT protects us, but 44x has no
such luxury.

Consolidate other users that depend on sstep and create a new config option.

Signed-off-by: Ananth N Mavinakayanahalli 
Signed-off-by: Suzuki K. Poulose 
Cc: linuxppc-dev@ozlabs.org
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt

powerpc: Build kernel with -mcmodel=medium

2013-01-10T06:00:31+00:00

Finally remove the two level TOC and build with -mcmodel=medium.

Unfortunately we can't build modules with -mcmodel=medium due to
the tricks the kernel module loader plays with percpu data:

# -mcmodel=medium breaks modules because it uses 32bit offsets from
# the TOC pointer to create pointers where possible. Pointers into the
# percpu data area are created by this method.
#
# The kernel module loader relocates the percpu data section from the
# original location (starting with 0xd...) to somewhere in the base
# kernel percpu data space (starting with 0xc...). We need a full
# 64bit relocation for this to work, hence -mcmodel=large.

On older kernels we fall back to the two level TOC (-mminimal-toc)

Signed-off-by: Anton Blanchard 
Signed-off-by: Benjamin Herrenschmidt

powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch

2012-07-03T04:14:46+00:00

Implement a POWER7 optimised memcpy using VMX and enhanced prefetch
instructions.

This is a copy of the POWER7 optimised copy_to_user/copy_from_user
loop. Detailed implementation and performance details can be found in
commit a66086b8197d (powerpc: POWER7 optimised
copy_to_user/copy_from_user using VMX).

I noticed memcpy issues when profiling a RAID6 workload:

	.memcpy
	.async_memcpy
	.async_copy_data
	.__raid_run_ops
	.handle_stripe
	.raid5d
	.md_thread

I created a simplified testcase by building a RAID6 array with 4 1GB
ramdisks (booting with brd.rd_size=1048576):

# mdadm -CR -e 1.2 /dev/md0 --level=6 -n4 /dev/ram[0-3]

I then timed how long it took to write to the entire array:

# dd if=/dev/zero of=/dev/md0 bs=1M

Before: 892 MB/s
After:  999 MB/s

A 12% improvement.

Signed-off-by: Anton Blanchard 
Signed-off-by: Benjamin Herrenschmidt

powerpc: POWER7 optimised copy_page using VMX and enhanced prefetch

2012-07-03T04:14:44+00:00

Implement a POWER7 optimised copy_page using VMX and enhanced
prefetch instructions. We use enhanced prefetch hints to prefetch
both the load and store side. We copy a cacheline at a time and
fall back to regular loads and stores if we are unable to use VMX
(eg we are in an interrupt).

The following microbenchmark was used to assess the impact of
the patch:

http://ozlabs.org/~anton/junkcode/page_fault_file.c

We test MAP_PRIVATE page faults across a 1GB file, 100 times:

# time ./page_fault_file -p -l 1G -i 100

Before: 22.25s
After:  18.89s

17% faster

Signed-off-by: Anton Blanchard 
Signed-off-by: Benjamin Herrenschmidt

powerpc: Rename copyuser_power7_vmx.c to vmx-helper.c

2012-07-03T04:14:43+00:00

Subsequent patches will add more VMX library functions and it makes
sense to keep all the c-code helper functions in the one file.

Signed-off-by: Anton Blanchard 
Signed-off-by: Benjamin Herrenschmidt

powerpc: 64bit optimised __clear_user

2012-07-03T04:14:41+00:00

I noticed __clear_user high up in a profile of one of my RAID stress
tests. The testcase was doing a dd from /dev/zero which ends up
calling __clear_user.

__clear_user is basically a loop with a single 4 byte store which
is horribly slow. We can do much better by aligning the desination
and doing 32 bytes of 8 byte stores in a loop.

The following testcase was used to verify the patch:

http://ozlabs.org/~anton/junkcode/stress_clear_user.c

To show the improvement in performance I ran a dd from /dev/zero
to /dev/null on a POWER7 box:

Before:

# dd if=/dev/zero of=/dev/null bs=1M count=10000
10485760000 bytes (10 GB) copied, 3.72379 s, 2.8 GB/s

After:

# time dd if=/dev/zero of=/dev/null bs=1M count=10000
10485760000 bytes (10 GB) copied, 0.728318 s, 14.4 GB/s

Over 5x faster.

Signed-off-by: Anton Blanchard 
Signed-off-by: Benjamin Herrenschmidt