<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/include/linux/raid, branch v6.16-rc6</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>raid6: Add RISC-V SIMD syndrome and recovery calculations</title>
<updated>2025-06-05T21:03:07+00:00</updated>
<author>
<name>Chunyan Zhang</name>
<email>zhangchunyan@iscas.ac.cn</email>
</author>
<published>2025-03-05T08:37:06+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=6093faaf9593fca92f96f165c95ff4b53353b1f4'/>
<id>6093faaf9593fca92f96f165c95ff4b53353b1f4</id>
<content type='text'>
The assembly is originally based on the ARM NEON and int.uc, but uses
RISC-V vector instructions to implement the RAID6 syndrome and
recovery calculations.

The functions are tested on QEMU running with the option "-icount shift=0":

  raid6: rvvx1    gen()  1008 MB/s
  raid6: rvvx2    gen()  1395 MB/s
  raid6: rvvx4    gen()  1584 MB/s
  raid6: rvvx8    gen()  1694 MB/s
  raid6: int64x8  gen()   113 MB/s
  raid6: int64x4  gen()   116 MB/s
  raid6: int64x2  gen()   272 MB/s
  raid6: int64x1  gen()   229 MB/s
  raid6: using algorithm rvvx8 gen() 1694 MB/s
  raid6: .... xor() 1000 MB/s, rmw enabled
  raid6: using rvv recovery algorithm

[Charlie: - Fixup vector options]

Signed-off-by: Charlie Jenkins &lt;charlie@rivosinc.com&gt;
Signed-off-by: Chunyan Zhang &lt;zhangchunyan@iscas.ac.cn&gt;
Reviewed-by: Charlie Jenkins &lt;charlie@rivosinc.com&gt;
Tested-by: Charlie Jenkins &lt;charlie@rivosinc.com&gt;
Link: https://lore.kernel.org/r/20250305083707.74218-1-zhangchunyan@iscas.ac.cn
Signed-off-by: Alexandre Ghiti &lt;alexghiti@rivosinc.com&gt;
Signed-off-by: Palmer Dabbelt &lt;palmer@dabbelt.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The assembly is originally based on the ARM NEON and int.uc, but uses
RISC-V vector instructions to implement the RAID6 syndrome and
recovery calculations.

The functions are tested on QEMU running with the option "-icount shift=0":

  raid6: rvvx1    gen()  1008 MB/s
  raid6: rvvx2    gen()  1395 MB/s
  raid6: rvvx4    gen()  1584 MB/s
  raid6: rvvx8    gen()  1694 MB/s
  raid6: int64x8  gen()   113 MB/s
  raid6: int64x4  gen()   116 MB/s
  raid6: int64x2  gen()   272 MB/s
  raid6: int64x1  gen()   229 MB/s
  raid6: using algorithm rvvx8 gen() 1694 MB/s
  raid6: .... xor() 1000 MB/s, rmw enabled
  raid6: using rvv recovery algorithm

[Charlie: - Fixup vector options]

Signed-off-by: Charlie Jenkins &lt;charlie@rivosinc.com&gt;
Signed-off-by: Chunyan Zhang &lt;zhangchunyan@iscas.ac.cn&gt;
Reviewed-by: Charlie Jenkins &lt;charlie@rivosinc.com&gt;
Tested-by: Charlie Jenkins &lt;charlie@rivosinc.com&gt;
Link: https://lore.kernel.org/r/20250305083707.74218-1-zhangchunyan@iscas.ac.cn
Signed-off-by: Alexandre Ghiti &lt;alexghiti@rivosinc.com&gt;
Signed-off-by: Palmer Dabbelt &lt;palmer@dabbelt.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>lib/raid6: Drop IA64 support</title>
<updated>2023-09-11T08:13:18+00:00</updated>
<author>
<name>Ard Biesheuvel</name>
<email>ardb@kernel.org</email>
</author>
<published>2023-01-13T17:08:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=b089ea3cc30de85ea7e20aa66500feb4082dfbf7'/>
<id>b089ea3cc30de85ea7e20aa66500feb4082dfbf7</id>
<content type='text'>
Drop Itanium support from the RAID6 code, and along with it, the 16x and
32x unrolled versions, which were only used by IA64.

Signed-off-by: Ard Biesheuvel &lt;ardb@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Drop Itanium support from the RAID6 code, and along with it, the 16x and
32x unrolled versions, which were only used by IA64.

Signed-off-by: Ard Biesheuvel &lt;ardb@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>raid6: Add LoongArch SIMD recovery implementation</title>
<updated>2023-09-06T14:53:55+00:00</updated>
<author>
<name>WANG Xuerui</name>
<email>git@xen0n.name</email>
</author>
<published>2023-09-06T14:53:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=f2091321044d9fbcadb93dfc1c9cf23e563ea40c'/>
<id>f2091321044d9fbcadb93dfc1c9cf23e563ea40c</id>
<content type='text'>
Similar to the syndrome calculation, the recovery algorithms also work
on 64 bytes at a time to align with the L1 cache line size of current
and future LoongArch cores (that we care about). Which means
unrolled-by-4 LSX and unrolled-by-2 LASX code.

The assembly is originally based on the x86 SSSE3/AVX2 ports, but
register allocation has been redone to take advantage of LSX/LASX's 32
vector registers, and instruction sequence has been optimized to suit
(e.g. LoongArch can perform per-byte srl and andi on vectors, but x86
cannot).

Performance numbers measured by instrumenting the raid6test code, on a
3A5000 system clocked at 2.5GHz:

&gt; lasx  2data: 354.987 MiB/s
&gt; lasx  datap: 350.430 MiB/s
&gt; lsx   2data: 340.026 MiB/s
&gt; lsx   datap: 337.318 MiB/s
&gt; intx1 2data: 164.280 MiB/s
&gt; intx1 datap: 187.966 MiB/s

Because recovery algorithms are chosen solely based on priority and
availability, lasx is marked as priority 2 and lsx priority 1. At least
for the current generation of LoongArch micro-architectures, LASX should
always be faster than LSX whenever supported, and have similar power
consumption characteristics (because the only known LASX-capable uarch,
the LA464, always compute the full 256-bit result for vector ops).

Acked-by: Song Liu &lt;song@kernel.org&gt;
Signed-off-by: WANG Xuerui &lt;git@xen0n.name&gt;
Signed-off-by: Huacai Chen &lt;chenhuacai@loongson.cn&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Similar to the syndrome calculation, the recovery algorithms also work
on 64 bytes at a time to align with the L1 cache line size of current
and future LoongArch cores (that we care about). Which means
unrolled-by-4 LSX and unrolled-by-2 LASX code.

The assembly is originally based on the x86 SSSE3/AVX2 ports, but
register allocation has been redone to take advantage of LSX/LASX's 32
vector registers, and instruction sequence has been optimized to suit
(e.g. LoongArch can perform per-byte srl and andi on vectors, but x86
cannot).

Performance numbers measured by instrumenting the raid6test code, on a
3A5000 system clocked at 2.5GHz:

&gt; lasx  2data: 354.987 MiB/s
&gt; lasx  datap: 350.430 MiB/s
&gt; lsx   2data: 340.026 MiB/s
&gt; lsx   datap: 337.318 MiB/s
&gt; intx1 2data: 164.280 MiB/s
&gt; intx1 datap: 187.966 MiB/s

Because recovery algorithms are chosen solely based on priority and
availability, lasx is marked as priority 2 and lsx priority 1. At least
for the current generation of LoongArch micro-architectures, LASX should
always be faster than LSX whenever supported, and have similar power
consumption characteristics (because the only known LASX-capable uarch,
the LA464, always compute the full 256-bit result for vector ops).

Acked-by: Song Liu &lt;song@kernel.org&gt;
Signed-off-by: WANG Xuerui &lt;git@xen0n.name&gt;
Signed-off-by: Huacai Chen &lt;chenhuacai@loongson.cn&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>raid6: Add LoongArch SIMD syndrome calculation</title>
<updated>2023-09-06T14:53:55+00:00</updated>
<author>
<name>WANG Xuerui</name>
<email>git@xen0n.name</email>
</author>
<published>2023-09-06T14:53:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=8f3f06dfd6873135068ccf1a0b386308e8c4da38'/>
<id>8f3f06dfd6873135068ccf1a0b386308e8c4da38</id>
<content type='text'>
The algorithms work on 64 bytes at a time, which is the L1 cache line
size of all current and future LoongArch cores (that we care about), as
confirmed by Huacai. The code is based on the generic int.uc algorithm,
unrolled 4 times for LSX and 2 times for LASX. Further unrolling does
not meaningfully improve the performance according to experiments.

Performance numbers measured during system boot on a 3A5000 @ 2.5GHz:

&gt; raid6: lasx     gen() 12726 MB/s
&gt; raid6: lsx      gen() 10001 MB/s
&gt; raid6: int64x8  gen()  2876 MB/s
&gt; raid6: int64x4  gen()  3867 MB/s
&gt; raid6: int64x2  gen()  2531 MB/s
&gt; raid6: int64x1  gen()  1945 MB/s

Comparison of xor() speeds (from different boots but meaningful anyway):

&gt; lasx:    11226 MB/s
&gt; lsx:     6395 MB/s
&gt; int64x4: 2147 MB/s

Performance as measured by raid6test:

&gt; raid6: lasx     gen() 25109 MB/s
&gt; raid6: lsx      gen() 13233 MB/s
&gt; raid6: int64x8  gen()  4164 MB/s
&gt; raid6: int64x4  gen()  6005 MB/s
&gt; raid6: int64x2  gen()  5781 MB/s
&gt; raid6: int64x1  gen()  4119 MB/s
&gt; raid6: using algorithm lasx gen() 25109 MB/s
&gt; raid6: .... xor() 14439 MB/s, rmw enabled

Acked-by: Song Liu &lt;song@kernel.org&gt;
Signed-off-by: WANG Xuerui &lt;git@xen0n.name&gt;
Signed-off-by: Huacai Chen &lt;chenhuacai@loongson.cn&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The algorithms work on 64 bytes at a time, which is the L1 cache line
size of all current and future LoongArch cores (that we care about), as
confirmed by Huacai. The code is based on the generic int.uc algorithm,
unrolled 4 times for LSX and 2 times for LASX. Further unrolling does
not meaningfully improve the performance according to experiments.

Performance numbers measured during system boot on a 3A5000 @ 2.5GHz:

&gt; raid6: lasx     gen() 12726 MB/s
&gt; raid6: lsx      gen() 10001 MB/s
&gt; raid6: int64x8  gen()  2876 MB/s
&gt; raid6: int64x4  gen()  3867 MB/s
&gt; raid6: int64x2  gen()  2531 MB/s
&gt; raid6: int64x1  gen()  1945 MB/s

Comparison of xor() speeds (from different boots but meaningful anyway):

&gt; lasx:    11226 MB/s
&gt; lsx:     6395 MB/s
&gt; int64x4: 2147 MB/s

Performance as measured by raid6test:

&gt; raid6: lasx     gen() 25109 MB/s
&gt; raid6: lsx      gen() 13233 MB/s
&gt; raid6: int64x8  gen()  4164 MB/s
&gt; raid6: int64x4  gen()  6005 MB/s
&gt; raid6: int64x2  gen()  5781 MB/s
&gt; raid6: int64x1  gen()  4119 MB/s
&gt; raid6: using algorithm lasx gen() 25109 MB/s
&gt; raid6: .... xor() 14439 MB/s, rmw enabled

Acked-by: Song Liu &lt;song@kernel.org&gt;
Signed-off-by: WANG Xuerui &lt;git@xen0n.name&gt;
Signed-off-by: Huacai Chen &lt;chenhuacai@loongson.cn&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>lib/raid6: drop RAID6_USE_EMPTY_ZERO_PAGE</title>
<updated>2022-11-14T17:35:50+00:00</updated>
<author>
<name>Giulio Benetti</name>
<email>giulio.benetti@benettiengineering.com</email>
</author>
<published>2022-10-19T16:04:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=42271ca389edb0446b9e492858b4c38083b0b9f8'/>
<id>42271ca389edb0446b9e492858b4c38083b0b9f8</id>
<content type='text'>
RAID6_USE_EMPTY_ZERO_PAGE is unused and hardcoded to 0, so let's drop it.

Signed-off-by: Giulio Benetti &lt;giulio.benetti@benettiengineering.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
RAID6_USE_EMPTY_ZERO_PAGE is unused and hardcoded to 0, so let's drop it.

Signed-off-by: Giulio Benetti &lt;giulio.benetti@benettiengineering.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>lib/xor: make xor prototypes more friendly to compiler vectorization</title>
<updated>2022-02-11T09:39:39+00:00</updated>
<author>
<name>Ard Biesheuvel</name>
<email>ardb@kernel.org</email>
</author>
<published>2022-02-05T15:23:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=297565aa22cfa80ab0f88c3569693aea0b6afb6d'/>
<id>297565aa22cfa80ab0f88c3569693aea0b6afb6d</id>
<content type='text'>
Modern compilers are perfectly capable of extracting parallelism from
the XOR routines, provided that the prototypes reflect the nature of the
input accurately, in particular, the fact that the input vectors are
expected not to overlap. This is not documented explicitly, but is
implied by the interchangeability of the various C routines, some of
which use temporary variables while others don't: this means that these
routines only behave identically for non-overlapping inputs.

So let's decorate these input vectors with the __restrict modifier,
which informs the compiler that there is no overlap. While at it, make
the input-only vectors pointer-to-const as well.

Tested-by: Nathan Chancellor &lt;nathan@kernel.org&gt;
Signed-off-by: Ard Biesheuvel &lt;ardb@kernel.org&gt;
Reviewed-by: Nick Desaulniers &lt;ndesaulniers@google.com&gt;
Link: https://github.com/ClangBuiltLinux/linux/issues/563
Signed-off-by: Herbert Xu &lt;herbert@gondor.apana.org.au&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Modern compilers are perfectly capable of extracting parallelism from
the XOR routines, provided that the prototypes reflect the nature of the
input accurately, in particular, the fact that the input vectors are
expected not to overlap. This is not documented explicitly, but is
implied by the interchangeability of the various C routines, some of
which use temporary variables while others don't: this means that these
routines only behave identically for non-overlapping inputs.

So let's decorate these input vectors with the __restrict modifier,
which informs the compiler that there is no overlap. While at it, make
the input-only vectors pointer-to-const as well.

Tested-by: Nathan Chancellor &lt;nathan@kernel.org&gt;
Signed-off-by: Ard Biesheuvel &lt;ardb@kernel.org&gt;
Reviewed-by: Nick Desaulniers &lt;ndesaulniers@google.com&gt;
Link: https://github.com/ClangBuiltLinux/linux/issues/563
Signed-off-by: Herbert Xu &lt;herbert@gondor.apana.org.au&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>lib/raid6: Use strict priority ranking for pq gen() benchmarking</title>
<updated>2022-01-06T16:37:03+00:00</updated>
<author>
<name>Dirk Müller</name>
<email>dmueller@suse.de</email>
</author>
<published>2022-01-05T16:38:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=36dacddbf0bdba86cd00f066b4d724157eeb63f1'/>
<id>36dacddbf0bdba86cd00f066b4d724157eeb63f1</id>
<content type='text'>
On x86_64, currently 3 variants of AVX512, 3 variants of AVX2
and 3 variants of SSE2 are benchmarked on initialization, taking
between 144-153 jiffies. Testing across a hardware pool of
various generations of intel cpus I could not find a single
case where SSE2 won over AVX2 or AVX512. There are cases where
AVX2 wins over AVX512 however.

Change "prefer" into an integer priority field (similar to
how recov selection works) to have more than one ranking level
available, which is backwards compatible with existing behavior.

Give AVX2/512 variants higher priority over SSE2 in order to skip
SSE testing when AVX is available. in a AVX2/x86_64/HZ=250 case this
saves in the order of 200ms of initialization time.

Signed-off-by: Dirk Müller &lt;dmueller@suse.de&gt;
Acked-by: Paul Menzel &lt;pmenzel@molgen.mpg.de&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
On x86_64, currently 3 variants of AVX512, 3 variants of AVX2
and 3 variants of SSE2 are benchmarked on initialization, taking
between 144-153 jiffies. Testing across a hardware pool of
various generations of intel cpus I could not find a single
case where SSE2 won over AVX2 or AVX512. There are cases where
AVX2 wins over AVX512 however.

Change "prefer" into an integer priority field (similar to
how recov selection works) to have more than one ranking level
available, which is backwards compatible with existing behavior.

Give AVX2/512 variants higher priority over SSE2 in order to skip
SSE testing when AVX is available. in a AVX2/x86_64/HZ=250 case this
saves in the order of 200ms of initialization time.

Signed-off-by: Dirk Müller &lt;dmueller@suse.de&gt;
Acked-by: Paul Menzel &lt;pmenzel@molgen.mpg.de&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: remove the kernel version of md_u.h</title>
<updated>2020-07-16T13:35:21+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2020-06-07T14:33:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=1a6a050620e496abf42749a2e1d3882645cc053f'/>
<id>1a6a050620e496abf42749a2e1d3882645cc053f</id>
<content type='text'>
mdp_major can just move to drivers/md/md.h.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Acked-by: Song Liu &lt;song@kernel.org&gt;
Acked-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
mdp_major can just move to drivers/md/md.h.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Acked-by: Song Liu &lt;song@kernel.org&gt;
Acked-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: move the early init autodetect code to drivers/md/</title>
<updated>2020-07-16T13:34:47+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2020-06-07T14:18:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=4f5b246b37e024955c0fcca0c7f5952089052d1d'/>
<id>4f5b246b37e024955c0fcca0c7f5952089052d1d</id>
<content type='text'>
Just like the NFS and CIFS root code this better lives with the
driver it is tightly integrated with.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Acked-by: Song Liu &lt;song@kernel.org&gt;
Acked-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Just like the NFS and CIFS root code this better lives with the
driver it is tightly integrated with.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Acked-by: Song Liu &lt;song@kernel.org&gt;
Acked-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>block: cleanup how md_autodetect_dev is called</title>
<updated>2020-03-24T13:57:08+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2020-03-24T07:25:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=74cc979c3c7f8328b24651daf15280f07533e735'/>
<id>74cc979c3c7f8328b24651daf15280f07533e735</id>
<content type='text'>
Add a new include/linux/raid/detect.h header to declare the
md_autodetect_dev prototype which can be shared between md and
the partition code.  Then use IS_BUILTIN to call it instead of the
ifdef magic.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add a new include/linux/raid/detect.h header to declare the
md_autodetect_dev prototype which can be shared between md and
the partition code.  Then use IS_BUILTIN to call it instead of the
ifdef magic.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</pre>
</div>
</content>
</entry>
</feed>
