linux-toradex.git/include/linux/raid, branch v6.16-rc6

raid6: Add RISC-V SIMD syndrome and recovery calculations

2025-06-05T21:03:07+00:00

The assembly is originally based on the ARM NEON and int.uc, but uses
RISC-V vector instructions to implement the RAID6 syndrome and
recovery calculations.

The functions are tested on QEMU running with the option "-icount shift=0":

  raid6: rvvx1    gen()  1008 MB/s
  raid6: rvvx2    gen()  1395 MB/s
  raid6: rvvx4    gen()  1584 MB/s
  raid6: rvvx8    gen()  1694 MB/s
  raid6: int64x8  gen()   113 MB/s
  raid6: int64x4  gen()   116 MB/s
  raid6: int64x2  gen()   272 MB/s
  raid6: int64x1  gen()   229 MB/s
  raid6: using algorithm rvvx8 gen() 1694 MB/s
  raid6: .... xor() 1000 MB/s, rmw enabled
  raid6: using rvv recovery algorithm

[Charlie: - Fixup vector options]

Signed-off-by: Charlie Jenkins 
Signed-off-by: Chunyan Zhang 
Reviewed-by: Charlie Jenkins 
Tested-by: Charlie Jenkins 
Link: https://lore.kernel.org/r/20250305083707.74218-1-zhangchunyan@iscas.ac.cn
Signed-off-by: Alexandre Ghiti 
Signed-off-by: Palmer Dabbelt

lib/raid6: Drop IA64 support

2023-09-11T08:13:18+00:00

Drop Itanium support from the RAID6 code, and along with it, the 16x and
32x unrolled versions, which were only used by IA64.

Signed-off-by: Ard Biesheuvel

raid6: Add LoongArch SIMD recovery implementation

2023-09-06T14:53:55+00:00

Similar to the syndrome calculation, the recovery algorithms also work
on 64 bytes at a time to align with the L1 cache line size of current
and future LoongArch cores (that we care about). Which means
unrolled-by-4 LSX and unrolled-by-2 LASX code.

The assembly is originally based on the x86 SSSE3/AVX2 ports, but
register allocation has been redone to take advantage of LSX/LASX's 32
vector registers, and instruction sequence has been optimized to suit
(e.g. LoongArch can perform per-byte srl and andi on vectors, but x86
cannot).

Performance numbers measured by instrumenting the raid6test code, on a
3A5000 system clocked at 2.5GHz:

> lasx  2data: 354.987 MiB/s
> lasx  datap: 350.430 MiB/s
> lsx   2data: 340.026 MiB/s
> lsx   datap: 337.318 MiB/s
> intx1 2data: 164.280 MiB/s
> intx1 datap: 187.966 MiB/s

Because recovery algorithms are chosen solely based on priority and
availability, lasx is marked as priority 2 and lsx priority 1. At least
for the current generation of LoongArch micro-architectures, LASX should
always be faster than LSX whenever supported, and have similar power
consumption characteristics (because the only known LASX-capable uarch,
the LA464, always compute the full 256-bit result for vector ops).

Acked-by: Song Liu 
Signed-off-by: WANG Xuerui 
Signed-off-by: Huacai Chen

raid6: Add LoongArch SIMD syndrome calculation

2023-09-06T14:53:55+00:00

The algorithms work on 64 bytes at a time, which is the L1 cache line
size of all current and future LoongArch cores (that we care about), as
confirmed by Huacai. The code is based on the generic int.uc algorithm,
unrolled 4 times for LSX and 2 times for LASX. Further unrolling does
not meaningfully improve the performance according to experiments.

Performance numbers measured during system boot on a 3A5000 @ 2.5GHz:

> raid6: lasx     gen() 12726 MB/s
> raid6: lsx      gen() 10001 MB/s
> raid6: int64x8  gen()  2876 MB/s
> raid6: int64x4  gen()  3867 MB/s
> raid6: int64x2  gen()  2531 MB/s
> raid6: int64x1  gen()  1945 MB/s

Comparison of xor() speeds (from different boots but meaningful anyway):

> lasx:    11226 MB/s
> lsx:     6395 MB/s
> int64x4: 2147 MB/s

Performance as measured by raid6test:

> raid6: lasx     gen() 25109 MB/s
> raid6: lsx      gen() 13233 MB/s
> raid6: int64x8  gen()  4164 MB/s
> raid6: int64x4  gen()  6005 MB/s
> raid6: int64x2  gen()  5781 MB/s
> raid6: int64x1  gen()  4119 MB/s
> raid6: using algorithm lasx gen() 25109 MB/s
> raid6: .... xor() 14439 MB/s, rmw enabled

Acked-by: Song Liu 
Signed-off-by: WANG Xuerui 
Signed-off-by: Huacai Chen

lib/raid6: drop RAID6_USE_EMPTY_ZERO_PAGE

2022-11-14T17:35:50+00:00

RAID6_USE_EMPTY_ZERO_PAGE is unused and hardcoded to 0, so let's drop it.

Signed-off-by: Giulio Benetti 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Song Liu

lib/xor: make xor prototypes more friendly to compiler vectorization

2022-02-11T09:39:39+00:00

Modern compilers are perfectly capable of extracting parallelism from
the XOR routines, provided that the prototypes reflect the nature of the
input accurately, in particular, the fact that the input vectors are
expected not to overlap. This is not documented explicitly, but is
implied by the interchangeability of the various C routines, some of
which use temporary variables while others don't: this means that these
routines only behave identically for non-overlapping inputs.

So let's decorate these input vectors with the __restrict modifier,
which informs the compiler that there is no overlap. While at it, make
the input-only vectors pointer-to-const as well.

Tested-by: Nathan Chancellor 
Signed-off-by: Ard Biesheuvel 
Reviewed-by: Nick Desaulniers 
Link: https://github.com/ClangBuiltLinux/linux/issues/563
Signed-off-by: Herbert Xu

lib/raid6: Use strict priority ranking for pq gen() benchmarking

2022-01-06T16:37:03+00:00

On x86_64, currently 3 variants of AVX512, 3 variants of AVX2
and 3 variants of SSE2 are benchmarked on initialization, taking
between 144-153 jiffies. Testing across a hardware pool of
various generations of intel cpus I could not find a single
case where SSE2 won over AVX2 or AVX512. There are cases where
AVX2 wins over AVX512 however.

Change "prefer" into an integer priority field (similar to
how recov selection works) to have more than one ranking level
available, which is backwards compatible with existing behavior.

Give AVX2/512 variants higher priority over SSE2 in order to skip
SSE testing when AVX is available. in a AVX2/x86_64/HZ=250 case this
saves in the order of 200ms of initialization time.

Signed-off-by: Dirk Müller 
Acked-by: Paul Menzel 
Signed-off-by: Song Liu

md: remove the kernel version of md_u.h

2020-07-16T13:35:21+00:00

mdp_major can just move to drivers/md/md.h.

Signed-off-by: Christoph Hellwig 
Acked-by: Song Liu 
Acked-by: Linus Torvalds

md: move the early init autodetect code to drivers/md/

2020-07-16T13:34:47+00:00

Just like the NFS and CIFS root code this better lives with the
driver it is tightly integrated with.

Signed-off-by: Christoph Hellwig 
Acked-by: Song Liu 
Acked-by: Linus Torvalds

block: cleanup how md_autodetect_dev is called

2020-03-24T13:57:08+00:00

Add a new include/linux/raid/detect.h header to declare the
md_autodetect_dev prototype which can be shared between md and
the partition code.  Then use IS_BUILTIN to call it instead of the
ifdef magic.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Jens Axboe