linux-toradex.git - Linux kernel for Apalis and Colibri modules

Age	Commit message (Collapse)	Author
2025-12-02	Merge tag 'aes-gcm-for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull AES-GCM optimizations from Eric Biggers: "More optimizations and cleanups for the x86_64 AES-GCM code: - Add a VAES+AVX2 optimized implementation of AES-GCM. This is very helpful on CPUs that have VAES but not AVX512, such as AMD Zen 3. - Make the VAES+AVX512 optimized implementation of AES-GCM handle large amounts of associated data efficiently. - Remove the "avx10_256" implementation of AES-GCM. It's superseded by the VAES+AVX2 optimized implementation. - Rename the "avx10_512" implementation to "avx512" Overall, this fills in a gap where AES-GCM wasn't fully optimized on some recent CPUs. It also drops code that won't be as useful as initially expected due to AVX10/256 being dropped from the AVX10 spec" * tag 'aes-gcm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: crypto: x86/aes-gcm-vaes-avx2 - initialize full %rax return register crypto: x86/aes-gcm - optimize long AAD processing with AVX512 crypto: x86/aes-gcm - optimize AVX512 precomputation of H^2 from H^1 crypto: x86/aes-gcm - revise some comments in AVX512 code crypto: x86/aes-gcm - reorder AVX512 precompute and aad_update functions crypto: x86/aes-gcm - clean up AVX512 code to assume 512-bit vectors crypto: x86/aes-gcm - rename avx10 and avx10_512 to avx512 crypto: x86/aes-gcm - remove VAES+AVX10/256 optimized code crypto: x86/aes-gcm - add VAES+AVX2 optimized code
2025-11-11	lib/crypto: x86/polyval: Migrate optimized code into library	Eric Biggers
	Migrate the x86_64 implementation of POLYVAL into lib/crypto/, wiring it up to the POLYVAL library interface. This makes the POLYVAL library be properly optimized on x86_64. This drops the x86_64 optimizations of polyval in the crypto_shash API. That's fine, since polyval will be removed from crypto_shash entirely since it is unneeded there. But even if it comes back, the crypto_shash API could just be implemented on top of the library API, as usual. Adjust the names and prototypes of the assembly functions to align more closely with the rest of the library code. Also replace a movaps instruction with movups to remove the assumption that the key struct is 16-byte aligned. Users can still align the key if they want (and at least in this case, movups is just as fast as movaps), but it's inconvenient to require it. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20251109234726.638437-6-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-10-26	crypto: x86/aes-gcm - rename avx10 and avx10_512 to avx512	Eric Biggers
	With the "avx10_256" code removed and the AVX10 specification having been changed to basically just be a re-packaged AVX512, the "avx10_512" name no longer makes sense. Replace it with "avx512". While doing this, also add the "vaes_" prefix in places that didn't already have it. The result is that the two VAES optimized implementations are consistently called vaes_avx2 and vaes_avx512. (Also drop the "-x86_64" part of the assembly filename, to keep it from getting too long. There's no 32-bit version of this code, and the fact that it's 64-bit is unremarkable; it's the norm for new code.) Note: although aes_gcm_aad_update_vaes_avx512() (previously called aes_gcm_aad_update_vaes_avx10()) uses at most 256-bit vectors, it still depends on the AVX512 CPU feature. So its new name is still accurate. Also, a later commit will make it sometimes use 512-bit vectors anyway. Acked-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20251002023117.37504-4-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-10-26	crypto: x86/aes-gcm - add VAES+AVX2 optimized code	Eric Biggers
	Add an implementation of AES-GCM that uses 256-bit vectors and the following CPU features: Vector AES (VAES), Vector Carryless Multiplication (VPCLMULQDQ), and AVX2. It doesn't require AVX512. So unlike the existing VAES+AVX512 code, it works on CPUs that support VAES but not AVX512, specifically: - AMD Zen 3, both client and server - Intel Alder Lake, Raptor Lake, Meteor Lake, Arrow Lake, and Lunar Lake. (These are client CPUs.) - Intel Sierra Forest. (This is a server CPU.) On these CPUs, this VAES+AVX2 code is much faster than the existing AES-NI code. The AES-NI code uses only 128-bit vectors. These CPUs are widely deployed, making VAES+AVX2 code worthwhile even though hopefully future x86_64 CPUs will uniformly support AVX512. This implementation will also serve as the fallback 256-bit implementation for older Intel CPUs (Ice Lake and Tiger Lake) that support AVX512 but downclock too eagerly when 512-bit vectors are used. Currently, the VAES+AVX10/256 implementation serves that purpose. A later commit will remove that and just use the VAES+AVX2 one. (Note that AES-XTS and AES-CTR already successfully use this approach.) I originally wrote this AES-GCM implementation for BoringSSL. It's been in BoringSSL for a while now, including in Chromium. This is a port of it to the Linux kernel. The main changes in the Linux version include: - Port from "perlasm" to a standard .S file. - Align all assembly functions with what aesni-intel_glue.c expects, including adding support for lengths not a multiple of 16 bytes. - Rework the en/decryption of the final 1 to 127 bytes. This commit increases AES-256-GCM throughput on AMD Milan (Zen 3) by up to 74%, as shown by the following tables: Table 1: AES-256-GCM encryption throughput change, CPU vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ----------------------+-------+-------+-------+-------+-------+-------+ AMD Milan (Zen 3) \| 67% \| 59% \| 61% \| 39% \| 23% \| 27% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ----------------------+-------+-------+-------+-------+-------+ AMD Milan (Zen 3) \| 14% \| 12% \| 7% \| 7% \| 0% \| Table 2: AES-256-GCM decryption throughput change, CPU vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ----------------------+-------+-------+-------+-------+-------+-------+ AMD Milan (Zen 3) \| 74% \| 65% \| 65% \| 44% \| 23% \| 26% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ----------------------+-------+-------+-------+-------+-------+ AMD Milan (Zen 3) \| 12% \| 11% \| 3% \| 2% \| -3% \| Acked-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20251002023117.37504-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-10-11	Merge tag 'x86_cleanups_for_v6.18_rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Borislav Petkov: - Simplify inline asm flag output operands now that the minimum compiler version supports the =@ccCOND syntax - Remove a bunch of AS_* Kconfig symbols which detect assembler support for various instruction mnemonics now that the minimum assembler version supports them all - The usual cleanups all over the place * tag 'x86_cleanups_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/asm: Remove code depending on __GCC_ASM_FLAG_OUTPUTS__ x86/sgx: Use ENCLS mnemonic in <kernel/cpu/sgx/encls.h> x86/mtrr: Remove license boilerplate text with bad FSF address x86/asm: Use RDPKRU and WRPKRU mnemonics in <asm/special_insns.h> x86/idle: Use MONITORX and MWAITX mnemonics in <asm/mwait.h> x86/entry/fred: Push __KERNEL_CS directly x86/kconfig: Remove CONFIG_AS_AVX512 crypto: x86 - Remove CONFIG_AS_VPCLMULQDQ crypto: X86 - Remove CONFIG_AS_VAES crypto: x86 - Remove CONFIG_AS_GFNI x86/kconfig: Drop unused and needless config X86_64_SMP
2025-09-06	lib/crypto: curve25519: Consolidate into single module	Eric Biggers
	Reorganize the Curve25519 library code: - Build a single libcurve25519 module, instead of up to three modules: libcurve25519, libcurve25519-generic, and an arch-specific module. - Move the arch-specific Curve25519 code from arch/$(SRCARCH)/crypto/ to lib/crypto/$(SRCARCH)/. Centralize the build rules into lib/crypto/Makefile and lib/crypto/Kconfig. - Include the arch-specific code directly in lib/crypto/curve25519.c via a header, rather than using a separate .c file. - Eliminate the entanglement with CRYPTO. CRYPTO_LIB_CURVE25519 no longer selects CRYPTO, and the arch-specific Curve25519 code no longer depends on CRYPTO. This brings Curve25519 in line with the latest conventions for lib/crypto/, used by other algorithms. The exception is that I kept the generic code in separate translation units for now. (Some of the function names collide between the x86 and generic Curve25519 code. And the Curve25519 functions are very long anyway, so inlining doesn't matter as much for Curve25519 as it does for some other algorithms.) Link: https://lore.kernel.org/r/20250906213523.84915-11-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-08-21	crypto: x86 - Remove CONFIG_AS_VPCLMULQDQ	Uros Bizjak
	Current minimum required version of binutils is 2.30, which supports VPCLMULQDQ instruction mnemonics. Remove check for assembler support of VPCLMULQDQ instructions and all relevant macros for conditional compilation. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Link: https://lore.kernel.org/20250819085855.333380-3-ubizjak@gmail.com
2025-08-21	crypto: X86 - Remove CONFIG_AS_VAES	Uros Bizjak
	Current minimum required version of binutils is 2.30, which supports VAES instruction mnemonics. Remove check for assembler support of VAES instructions and all relevant macros for conditional compilation. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Link: https://lore.kernel.org/20250819085855.333380-2-ubizjak@gmail.com
2025-07-14	lib/crypto: x86/sha1: Migrate optimized code into library	Eric Biggers
	Instead of exposing the x86-optimized SHA-1 code via x86-specific crypto_shash algorithms, instead just implement the sha1_blocks() library function. This is much simpler, it makes the SHA-1 library functions be x86-optimized, and it fixes the longstanding issue where the x86-optimized SHA-1 code was disabled by default. SHA-1 still remains available through crypto_shash, but individual architectures no longer need to handle it. To match sha1_blocks(), change the type of the nblocks parameter of the assembly functions from int to size_t. The assembly functions actually already treated it as size_t. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250712232329.818226-14-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-06-30	lib/crypto: x86/sha512: Migrate optimized SHA-512 code to library	Eric Biggers
	Instead of exposing the x86-optimized SHA-512 code via x86-specific crypto_shash algorithms, instead just implement the sha512_blocks() library function. This is much simpler, it makes the SHA-512 (and SHA-384) library functions be x86-optimized, and it fixes the longstanding issue where the x86-optimized SHA-512 code was disabled by default. SHA-512 still remains available through crypto_shash, but individual architectures no longer need to handle it. To match sha512_blocks(), change the type of the nblocks parameter of the assembly functions from int to size_t. The assembly functions actually already treated it as size_t. Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250630160320.2888-15-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-05-05	crypto: x86/sha256 - implement library instead of shash	Eric Biggers
	Instead of providing crypto_shash algorithms for the arch-optimized SHA-256 code, instead implement the SHA-256 library. This is much simpler, it makes the SHA-256 library functions be arch-optimized, and it fixes the longstanding issue where the arch-optimized SHA-256 was disabled by default. SHA-256 still remains available through crypto_shash, but individual architectures no longer need to handle it. To match sha256_blocks_arch(), change the type of the nblocks parameter of the assembly functions from int to size_t. The assembly functions actually already treated it as size_t. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-04-28	crypto: x86 - move library functions to arch/x86/lib/crypto/	Eric Biggers
	Continue disentangling the crypto library functions from the generic crypto infrastructure by moving the x86 BLAKE2s, ChaCha, and Poly1305 library functions into a new directory arch/x86/lib/crypto/ that does not depend on CRYPTO. This mirrors the distinction between crypto/ and lib/crypto/. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-04-07	crypto: x86 - Remove CONFIG_AS_AVX512 handling	Uros Bizjak
	Current minimum required version of binutils is 2.25, which supports AVX-512 instruction mnemonics. Remove check for assembler support of AVX-512 instructions and all relevant macros for conditional compilation. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-04-07	crypto: x86 - Remove CONFIG_AS_SHA256_NI	Uros Bizjak
	Current minimum required version of binutils is 2.25, which supports SHA-256 instruction mnemonics. Remove check for assembler support of SHA-256 instructions and all relevant macros for conditional compilation. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-04-07	crypto: x86 - Remove CONFIG_AS_SHA1_NI	Uros Bizjak
	Current minimum required version of binutils is 2.25, which supports SHA-1 instruction mnemonics. Remove check for assembler support of SHA-1 instructions and all relevant macros for conditional compilation. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-02-22	crypto: x86/aes-ctr - rewrite AESNI+AVX optimized CTR and add VAES support	Eric Biggers
	Delete aes_ctrby8_avx-x86_64.S and add a new assembly file aes-ctr-avx-x86_64.S which follows a similar approach to aes-xts-avx-x86_64.S in that it uses a "template" to provide AESNI+AVX, VAES+AVX2, VAES+AVX10/256, and VAES+AVX10/512 code, instead of just AESNI+AVX. Wire it up to the crypto API accordingly. This greatly improves the performance of AES-CTR and AES-XCTR on VAES-capable CPUs, with the best case being AMD Zen 5 where an over 230% increase in throughput is seen on long messages. Performance on non-VAES-capable CPUs remains about the same, and the non-AVX AES-CTR code (aesni_ctr_enc) is also kept as-is for now. There are some slight regressions (less than 10%) on some short message lengths on some CPUs; these are difficult to avoid, given how the previous code was so heavily unrolled by message length, and they are not particularly important. Detailed performance results are given in the tables below. Both CTR and XCTR support is retained. The main loop remains 8-vector-wide, which differs from the 4-vector-wide main loops that are used in the XTS and GCM code. A wider loop is appropriate for CTR and XCTR since they have fewer other instructions (such as vpclmulqdq) to interleave with the AES instructions. Similar to what was the case for AES-GCM, the new assembly code also has a much smaller binary size, as it fixes the excessive unrolling by data length and key length present in the old code. Specifically, the new assembly file compiles to about 9 KB of text vs. 28 KB for the old file. This is despite 4x as many implementations being included. The tables below show the detailed performance results. The tables show percentage improvement in single-threaded throughput for repeated encryption of the given message length; an increase from 6000 MB/s to 12000 MB/s would be listed as 100%. They were collected by directly measuring the Linux crypto API performance using a custom kernel module. The tested CPUs were all server processors from Google Compute Engine except for Zen 5 which was a Ryzen 9 9950X desktop processor. Table 1: AES-256-CTR throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ---------------------+-------+-------+-------+-------+-------+-------+ AMD Zen 5 \| 232% \| 203% \| 212% \| 143% \| 71% \| 95% \| Intel Emerald Rapids \| 116% \| 116% \| 117% \| 91% \| 78% \| 79% \| Intel Ice Lake \| 109% \| 103% \| 107% \| 81% \| 54% \| 56% \| AMD Zen 4 \| 109% \| 91% \| 100% \| 70% \| 43% \| 59% \| AMD Zen 3 \| 92% \| 78% \| 87% \| 57% \| 32% \| 43% \| AMD Zen 2 \| 9% \| 8% \| 14% \| 12% \| 8% \| 21% \| Intel Skylake \| 7% \| 7% \| 8% \| 5% \| 3% \| 8% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ---------------------+-------+-------+-------+-------+-------+ AMD Zen 5 \| 57% \| 39% \| -9% \| 7% \| -7% \| Intel Emerald Rapids \| 37% \| 42% \| -0% \| 13% \| -8% \| Intel Ice Lake \| 39% \| 30% \| -1% \| 14% \| -9% \| AMD Zen 4 \| 42% \| 38% \| -0% \| 18% \| -3% \| AMD Zen 3 \| 38% \| 35% \| 6% \| 31% \| 5% \| AMD Zen 2 \| 24% \| 23% \| 5% \| 30% \| 3% \| Intel Skylake \| 9% \| 1% \| -4% \| 10% \| -7% \| Table 2: AES-256-XCTR throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ---------------------+-------+-------+-------+-------+-------+-------+ AMD Zen 5 \| 240% \| 201% \| 216% \| 151% \| 75% \| 108% \| Intel Emerald Rapids \| 100% \| 99% \| 102% \| 91% \| 94% \| 104% \| Intel Ice Lake \| 93% \| 89% \| 92% \| 74% \| 50% \| 64% \| AMD Zen 4 \| 86% \| 75% \| 83% \| 60% \| 41% \| 52% \| AMD Zen 3 \| 73% \| 63% \| 69% \| 45% \| 21% \| 33% \| AMD Zen 2 \| -2% \| -2% \| 2% \| 3% \| -1% \| 11% \| Intel Skylake \| -1% \| -1% \| 1% \| 2% \| -1% \| 9% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ---------------------+-------+-------+-------+-------+-------+ AMD Zen 5 \| 78% \| 56% \| -4% \| 38% \| -2% \| Intel Emerald Rapids \| 61% \| 55% \| 4% \| 32% \| -5% \| Intel Ice Lake \| 57% \| 42% \| 3% \| 44% \| -4% \| AMD Zen 4 \| 35% \| 28% \| -1% \| 17% \| -3% \| AMD Zen 3 \| 26% \| 23% \| -3% \| 11% \| -6% \| AMD Zen 2 \| 13% \| 24% \| -1% \| 14% \| -3% \| Intel Skylake \| 16% \| 8% \| -4% \| 35% \| -3% \| Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-12-01	x86/crc-t10dif: expose CRC-T10DIF function through lib	Eric Biggers
	Move the x86 CRC-T10DIF assembly code into the lib directory and wire it up to the library interface. This allows it to be used without going through the crypto API. It remains usable via the crypto API too via the shash algorithms that use the library interface. Thus all the arch-specific "shash" code becomes unnecessary and is removed. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241202012056.209768-5-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2024-12-01	x86/crc32: expose CRC32 functions through lib	Eric Biggers
	Move the x86 CRC32 assembly code into the lib directory and wire it up to the library interface. This allows it to be used without going through the crypto API. It remains usable via the crypto API too via the shash algorithms that use the library interface. Thus all the arch-specific "shash" code becomes unnecessary and is removed. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20241202010844.144356-14-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2024-06-07	crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM	Eric Biggers
	Rewrite the AES-NI implementations of AES-GCM, taking advantage of things I learned while writing the VAES-AVX10 implementations. This is a complete rewrite that reduces the AES-NI GCM source code size by about 70% and the binary code size by about 95%, while not regressing performance and in fact improving it significantly in many cases. The following summarizes the state before this patch: - The aesni-intel module registered algorithms "generic-gcm-aesni" and "rfc4106-gcm-aesni" with the crypto API that actually delegated to one of three underlying implementations according to the CPU capabilities detected at runtime: AES-NI, AES-NI + AVX, or AES-NI + AVX2. - The AES-NI + AVX and AES-NI + AVX2 assembly code was in aesni-intel_avx-x86_64.S and consisted of 2804 lines of source and 257 KB of binary. This massive binary size was not really appropriate, and depending on the kconfig it could take up over 1% the size of the entire vmlinux. The main loops did 8 blocks per iteration. The AVX code minimized the use of carryless multiplication whereas the AVX2 code did not. The "AVX2" code did not actually use AVX2; the check for AVX2 was really a check for Intel Haswell or later to detect support for fast carryless multiplication. The long source length was caused by factors such as significant code duplication. - The AES-NI only assembly code was in aesni-intel_asm.S and consisted of 1501 lines of source and 15 KB of binary. The main loops did 4 blocks per iteration and minimized the use of carryless multiplication by using Karatsuba multiplication and a multiplication-less reduction. - The assembly code was contributed in 2010-2013. Maintenance has been sporadic and most design choices haven't been revisited. - The assembly function prototypes and the corresponding glue code were separate from and were not consistent with the new VAES-AVX10 code I recently added. The older code had several issues such as not precomputing the GHASH key powers, which hurt performance. This rewrite achieves the following goals: - Much shorter source and binary sizes. The assembly source shrinks from 4300 lines to 1130 lines, and it produces about 9 KB of binary instead of 272 KB. This is achieved via a better designed AES-GCM implementation that doesn't excessively unroll the code and instead prioritizes the parts that really matter. Sharing the C glue code with the VAES-AVX10 implementations also saves 250 lines of C source. - Improve performance on most (possibly all) CPUs on which this code runs, for most (possibly all) message lengths. Benchmark results are given in Tables 1 and 2 below. - Use the same function prototypes and glue code as the new VAES-AVX10 algorithms. This fixes some issues with the integration of the assembly and results in some significant performance improvements, primarily on short messages. Also, the AVX and non-AVX implementations are now registered as separate algorithms with the crypto API, which makes them both testable by the self-tests. - Keep support for AES-NI without AVX (for Westmere, Silvermont, Goldmont, and Tremont), but unify the source code with AES-NI + AVX. Since 256-bit vectors cannot be used without VAES anyway, this is made feasible by just using the non-VEX coded form of most instructions. - Use a unified approach where the main loop does 8 blocks per iteration and uses Karatsuba multiplication to save one pclmulqdq per block but does not use the multiplication-less reduction. This strikes a good balance across the range of CPUs on which this code runs. - Don't spam the kernel log with an informational message on every boot. The following tables summarize the improvement in AES-GCM throughput on various CPU microarchitectures as a result of this patch: Table 1: AES-256-GCM encryption throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| -------------------+-------+-------+-------+-------+-------+-------+ Intel Broadwell \| 2% \| 8% \| 11% \| 18% \| 31% \| 26% \| Intel Skylake \| 1% \| 4% \| 7% \| 12% \| 26% \| 19% \| Intel Cascade Lake \| 3% \| 8% \| 10% \| 18% \| 33% \| 24% \| AMD Zen 1 \| 6% \| 12% \| 6% \| 15% \| 27% \| 24% \| AMD Zen 2 \| 8% \| 13% \| 13% \| 19% \| 26% \| 28% \| AMD Zen 3 \| 8% \| 14% \| 13% \| 19% \| 26% \| 25% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| -------------------+-------+-------+-------+-------+-------+ Intel Broadwell \| 35% \| 29% \| 45% \| 55% \| 54% \| Intel Skylake \| 25% \| 19% \| 28% \| 33% \| 27% \| Intel Cascade Lake \| 36% \| 28% \| 39% \| 49% \| 54% \| AMD Zen 1 \| 27% \| 22% \| 23% \| 29% \| 26% \| AMD Zen 2 \| 32% \| 24% \| 22% \| 25% \| 31% \| AMD Zen 3 \| 30% \| 24% \| 22% \| 23% \| 26% \| Table 2: AES-256-GCM decryption throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| -------------------+-------+-------+-------+-------+-------+-------+ Intel Broadwell \| 3% \| 8% \| 11% \| 19% \| 32% \| 28% \| Intel Skylake \| 3% \| 4% \| 7% \| 13% \| 28% \| 27% \| Intel Cascade Lake \| 3% \| 9% \| 11% \| 19% \| 33% \| 28% \| AMD Zen 1 \| 15% \| 18% \| 14% \| 20% \| 36% \| 33% \| AMD Zen 2 \| 9% \| 16% \| 13% \| 21% \| 26% \| 27% \| AMD Zen 3 \| 8% \| 15% \| 12% \| 18% \| 23% \| 23% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| -------------------+-------+-------+-------+-------+-------+ Intel Broadwell \| 36% \| 31% \| 40% \| 51% \| 53% \| Intel Skylake \| 28% \| 21% \| 23% \| 30% \| 30% \| Intel Cascade Lake \| 36% \| 29% \| 36% \| 47% \| 53% \| AMD Zen 1 \| 35% \| 31% \| 32% \| 35% \| 36% \| AMD Zen 2 \| 31% \| 30% \| 27% \| 38% \| 30% \| AMD Zen 3 \| 27% \| 23% \| 24% \| 32% \| 26% \| The above numbers are percentage improvements in single-thread throughput, so e.g. an increase from 3000 MB/s to 3300 MB/s would be listed as 10%. They were collected by directly measuring the Linux crypto API performance using a custom kernel module. Note that indirect benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O) include more overhead and won't see quite as much of a difference. All these benchmarks used an associated data length of 16 bytes. Note that AES-GCM is almost always used with short associated data lengths. I didn't test Intel CPUs before Broadwell, AMD CPUs before Zen 1, or Intel low-power CPUs, as these weren't readily available to me. However, based on the design of the new code and the available information about these other CPU microarchitectures, I wouldn't expect any significant regressions, and there's a good chance performance is improved just as it is above. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-06-07	crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM	Eric Biggers
	Add implementations of AES-GCM for x86_64 CPUs that support VAES (vector AES), VPCLMULQDQ (vector carryless multiplication), and either AVX512 or AVX10. There are two implementations, sharing most source code: one using 256-bit vectors and one using 512-bit vectors. This patch improves AES-GCM performance by up to 162%; see Tables 1 and 2 below. I wrote the new AES-GCM assembly code from scratch, focusing on correctness, performance, code size (both source and binary), and documenting the source. The new assembly file aes-gcm-avx10-x86_64.S is about 1200 lines including extensive comments, and it generates less than 8 KB of binary code. The main loop does 4 vectors at a time, with the AES and GHASH instructions interleaved. Any remainder is handled using a simple 1 vector at a time loop, with masking. Several VAES + AVX512 implementations of AES-GCM exist from Intel, including one in OpenSSL and one proposed for inclusion in Linux in 2021 (https://lore.kernel.org/linux-crypto/1611386920-28579-6-git-send-email-megha.dey@intel.com/). These aren't really suitable to be used, though, due to the massive amount of binary code generated (696 KB for OpenSSL, 200 KB for Linux) and well as the significantly larger amount of assembly source (4978 lines for OpenSSL, 1788 lines for Linux). Also, Intel's code does not support 256-bit vectors, which makes it not usable on future AVX10/256-only CPUs, and also not ideal for certain Intel CPUs that have downclocking issues. So I ended up starting from scratch. Usually my much shorter code is actually slightly faster than Intel's AVX512 code, though it depends on message length and on which of Intel's implementations is used; for details, see Tables 3 and 4 below. To facilitate potential integration into other projects, I've dual-licensed aes-gcm-avx10-x86_64.S under Apache-2.0 OR BSD-2-Clause, the same as the recently added RISC-V crypto code. The following two tables summarize the performance improvement over the existing AES-GCM code in Linux that uses AES-NI and AVX2: Table 1: AES-256-GCM encryption throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ----------------------+-------+-------+-------+-------+-------+-------+ Intel Ice Lake \| 42% \| 48% \| 60% \| 62% \| 70% \| 69% \| Intel Sapphire Rapids \| 157% \| 145% \| 162% \| 119% \| 96% \| 96% \| Intel Emerald Rapids \| 156% \| 144% \| 161% \| 115% \| 95% \| 100% \| AMD Zen 4 \| 103% \| 89% \| 78% \| 56% \| 54% \| 54% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ----------------------+-------+-------+-------+-------+-------+ Intel Ice Lake \| 66% \| 48% \| 49% \| 70% \| 53% \| Intel Sapphire Rapids \| 80% \| 60% \| 41% \| 62% \| 38% \| Intel Emerald Rapids \| 79% \| 60% \| 41% \| 62% \| 38% \| AMD Zen 4 \| 51% \| 35% \| 27% \| 32% \| 25% \| Table 2: AES-256-GCM decryption throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ----------------------+-------+-------+-------+-------+-------+-------+ Intel Ice Lake \| 42% \| 48% \| 59% \| 63% \| 67% \| 71% \| Intel Sapphire Rapids \| 159% \| 145% \| 161% \| 125% \| 102% \| 100% \| Intel Emerald Rapids \| 158% \| 144% \| 161% \| 124% \| 100% \| 103% \| AMD Zen 4 \| 110% \| 95% \| 80% \| 59% \| 56% \| 54% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ----------------------+-------+-------+-------+-------+-------+ Intel Ice Lake \| 67% \| 56% \| 46% \| 70% \| 56% \| Intel Sapphire Rapids \| 79% \| 62% \| 39% \| 61% \| 39% \| Intel Emerald Rapids \| 80% \| 62% \| 40% \| 58% \| 40% \| AMD Zen 4 \| 49% \| 36% \| 30% \| 35% \| 28% \| The above numbers are percentage improvements in single-thread throughput, so e.g. an increase from 4000 MB/s to 6000 MB/s would be listed as 50%. They were collected by directly measuring the Linux crypto API performance using a custom kernel module. Note that indirect benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O) include more overhead and won't see quite as much of a difference. All these benchmarks used an associated data length of 16 bytes. Note that AES-GCM is almost always used with short associated data lengths. The following two tables summarize how the performance of my code compares with Intel's AVX512 AES-GCM code, both the version that is in OpenSSL and the version that was proposed for inclusion in Linux. Neither version exists in Linux currently, but these are alternative AES-GCM implementations that could be chosen instead of mine. I collected the following numbers on Emerald Rapids using a userspace benchmark program that calls the assembly functions directly. I've also included a comparison with Cloudflare's AES-GCM implementation from https://boringssl-review.googlesource.com/c/boringssl/+/65987/3. Table 3: VAES-based AES-256-GCM encryption throughput in MB/s, implementation name vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ---------------------+-------+-------+-------+-------+-------+-------+ This implementation \| 14171 \| 12956 \| 12318 \| 9588 \| 7293 \| 6449 \| AVX512_Intel_OpenSSL \| 14022 \| 12467 \| 11863 \| 9107 \| 5891 \| 6472 \| AVX512_Intel_Linux \| 13954 \| 12277 \| 11530 \| 8712 \| 6627 \| 5898 \| AVX512_Cloudflare \| 12564 \| 11050 \| 10905 \| 8152 \| 5345 \| 5202 \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ---------------------+-------+-------+-------+-------+-------+ This implementation \| 4939 \| 3688 \| 1846 \| 1821 \| 738 \| AVX512_Intel_OpenSSL \| 4629 \| 4532 \| 2734 \| 2332 \| 1131 \| AVX512_Intel_Linux \| 4035 \| 2966 \| 1567 \| 1330 \| 639 \| AVX512_Cloudflare \| 3344 \| 2485 \| 1141 \| 1127 \| 456 \| Table 4: VAES-based AES-256-GCM decryption throughput in MB/s, implementation name vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ---------------------+-------+-------+-------+-------+-------+-------+ This implementation \| 14276 \| 13311 \| 13007 \| 11086 \| 8268 \| 8086 \| AVX512_Intel_OpenSSL \| 14067 \| 12620 \| 12421 \| 9587 \| 5954 \| 7060 \| AVX512_Intel_Linux \| 14116 \| 12795 \| 11778 \| 9269 \| 7735 \| 6455 \| AVX512_Cloudflare \| 13301 \| 12018 \| 11919 \| 9182 \| 7189 \| 6726 \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ---------------------+-------+-------+-------+-------+-------+ This implementation \| 6454 \| 5020 \| 2635 \| 2602 \| 1079 \| AVX512_Intel_OpenSSL \| 5184 \| 5799 \| 2957 \| 2545 \| 1228 \| AVX512_Intel_Linux \| 4394 \| 4247 \| 2235 \| 1635 \| 922 \| AVX512_Cloudflare \| 4289 \| 3851 \| 1435 \| 1417 \| 574 \| So, usually my code is actually slightly faster than Intel's code, though the OpenSSL implementation has a slight edge on messages shorter than 256 bytes in this microbenchmark. (This also holds true when doing the same tests on AMD Zen 4.) It can be seen that the large code size (up to 94x larger!) of the Intel implementations doesn't seem to bring much benefit, so starting from scratch with much smaller code, as I've done, seems appropriate. The performance of my code on messages shorter than 256 bytes could be improved through a limited amount of unrolling, but it's unclear it would be worth it, given code size considerations (e.g. caches) that don't get measured in microbenchmarks. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-05	crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs	Eric Biggers
	Add an assembly file aes-xts-avx-x86_64.S which contains a macro that expands into AES-XTS implementations for x86_64 CPUs that support at least AES-NI and AVX, optionally also taking advantage of VAES, VPCLMULQDQ, and AVX512 or AVX10. This patch doesn't expand the macro at all. Later patches will do so, adding each implementation individually so that the motivation and use case for each individual implementation can be fully presented. The file also provides a function aes_xts_encrypt_iv() which handles the encryption of the IV (tweak), using AES-NI and AVX. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-01-06	crypto: x86/aria - implement aria-avx512	Taehee Yoo
	aria-avx512 implementation uses AVX512 and GFNI. It supports 64way parallel processing. So, byteslicing code is changed to support 64way parallel. And it exports some aria-avx2 functions such as encrypt() and decrypt(). AVX and AVX2 have 16 registers. They should use memory to store/load state because of lack of registers. But AVX512 supports 32 registers. So, it doesn't require store/load in the s-box layer. It means that it can reduce overhead of store/load in the s-box layer. Also code become much simpler. Benchmark with modprobe tcrypt mode=610 num_mb=8192, i3-12100: ARIA-AVX512(128bit and 256bit) testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption tcrypt: 1 operation in 1504 cycles (1024 bytes) tcrypt: 1 operation in 4595 cycles (4096 bytes) tcrypt: 1 operation in 1763 cycles (1024 bytes) tcrypt: 1 operation in 5540 cycles (4096 bytes) testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption tcrypt: 1 operation in 1502 cycles (1024 bytes) tcrypt: 1 operation in 4615 cycles (4096 bytes) tcrypt: 1 operation in 1759 cycles (1024 bytes) tcrypt: 1 operation in 5554 cycles (4096 bytes) ARIA-AVX2 with GFNI(128bit and 256bit) testing speed of multibuffer ecb(aria) (ecb-aria-avx2) encryption tcrypt: 1 operation in 2003 cycles (1024 bytes) tcrypt: 1 operation in 5867 cycles (4096 bytes) tcrypt: 1 operation in 2358 cycles (1024 bytes) tcrypt: 1 operation in 7295 cycles (4096 bytes) testing speed of multibuffer ecb(aria) (ecb-aria-avx2) decryption tcrypt: 1 operation in 2004 cycles (1024 bytes) tcrypt: 1 operation in 5956 cycles (4096 bytes) tcrypt: 1 operation in 2409 cycles (1024 bytes) tcrypt: 1 operation in 7564 cycles (4096 bytes) Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-01-06	crypto: x86/aria - implement aria-avx2	Taehee Yoo
	aria-avx2 implementation uses AVX2, AES-NI, and GFNI. It supports 32way parallel processing. So, byteslicing code is changed to support 32way parallel. And it exports some aria-avx functions such as encrypt() and decrypt(). There are two main logics, s-box layer and diffusion layer. These codes are the same as aria-avx implementation. But some instruction are exchanged because they don't support 256bit registers. Also, AES-NI doesn't support 256bit register. So, aesenclast and aesdeclast are used twice like below: vextracti128 $1, ymm0, xmm6; vaesenclast xmm7, xmm0, xmm0; vaesenclast xmm7, xmm6, xmm6; vinserti128 $1, xmm6, ymm0, ymm0; Benchmark with modprobe tcrypt mode=610 num_mb=8192, i3-12100: ARIA-AVX2 with GFNI(128bit and 256bit) testing speed of multibuffer ecb(aria) (ecb-aria-avx2) encryption tcrypt: 1 operation in 2003 cycles (1024 bytes) tcrypt: 1 operation in 5867 cycles (4096 bytes) tcrypt: 1 operation in 2358 cycles (1024 bytes) tcrypt: 1 operation in 7295 cycles (4096 bytes) testing speed of multibuffer ecb(aria) (ecb-aria-avx2) decryption tcrypt: 1 operation in 2004 cycles (1024 bytes) tcrypt: 1 operation in 5956 cycles (4096 bytes) tcrypt: 1 operation in 2409 cycles (1024 bytes) tcrypt: 1 operation in 7564 cycles (4096 bytes) ARIA-AVX with GFNI(128bit and 256bit) testing speed of multibuffer ecb(aria) (ecb-aria-avx) encryption tcrypt: 1 operation in 2761 cycles (1024 bytes) tcrypt: 1 operation in 9390 cycles (4096 bytes) tcrypt: 1 operation in 3401 cycles (1024 bytes) tcrypt: 1 operation in 11876 cycles (4096 bytes) testing speed of multibuffer ecb(aria) (ecb-aria-avx) decryption tcrypt: 1 operation in 2735 cycles (1024 bytes) tcrypt: 1 operation in 9424 cycles (4096 bytes) tcrypt: 1 operation in 3369 cycles (1024 bytes) tcrypt: 1 operation in 11954 cycles (4096 bytes) Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-12-02	crypto: x86/curve25519 - disable gcov	Joe Fradley
	curve25519-x86_64.c fails to build when CONFIG_GCOV_KERNEL is enabled. The error is "inline assembly requires more registers than available" thrown from the `fsqr()` function. Therefore, excluding this file from GCOV profiling until this issue is resolved. Thereby allowing CONFIG_GCOV_PROFILE_ALL to be enabled for x86. Signed-off-by: Joe Fradley <joefradley@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-24	crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of ↵	Taehee Yoo
	aria cipher The implementation is based on the 32-bit implementation of the aria. Also, aria-avx process steps are the similar to the camellia-avx. 1. Byteslice(16way) 2. Add-round-key. 3. Sbox 4. Diffusion layer. Except for s-box, all steps are the same as the aria-generic implementation. s-box step is very similar to camellia and sm4 implementation. There are 2 implementations for s-box step. One is to use AES-NI and affine transformation, which is the same as Camellia, sm4, and others. Another is to use GFNI. GFNI implementation is faster than AES-NI implementation. So, it uses GFNI implementation if the running CPU supports GFNI. There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as AES's s-boxes. To calculate the first sbox, it just uses the aesenclast and then inverts shift_row. No more process is needed for this job because the first s-box is the same as the AES encryption s-box. To calculate the second sbox(invert of s1), it just uses the aesdeclast and then inverts shift_row. No more process is needed for this job because the second s-box is the same as the AES decryption s-box. To calculate the third s-box, it uses the aesenclast, then affine transformation, which is combined AES inverse affine and ARIA S2. To calculate the last s-box, it uses the aesdeclast, then affine transformation, which is combined X2 and AES forward affine. The optimized third and last s-box logic and GFNI s-box logic are implemented by Jussi Kivilinna. The aria-generic implementation is based on a 32-bit implementation, not an 8-bit implementation. the aria-avx Diffusion Layer implementation is based on aria-generic implementation because 8-bit implementation is not fit for parallel implementation but 32-bit is enough to fit for this. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-06-10	crypto: blake2s - remove shash module	Jason A. Donenfeld
	BLAKE2s has no currently known use as an shash. Just remove all of this unnecessary plumbing. Removing this shash was something we talked about back when we were making BLAKE2s a built-in, but I simply never got around to doing it. So this completes that project. Importantly, this fixs a bug in which the lib code depends on crypto_simd_disabled_for_test, causing linker errors. Also add more alignment tests to the selftests and compare SIMD and non-SIMD compression functions, to make up for what we lose from testmgr.c. Reported-by: gaochao <gaochao49@huawei.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: stable@vger.kernel.org Fixes: 6048fdcc5f26 ("lib/crypto: blake2s: include as built-in") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-06-10	crypto: x86/polyval - Add PCLMULQDQ accelerated implementation of POLYVAL	Nathan Huckleberry
	Add hardware accelerated version of POLYVAL for x86-64 CPUs with PCLMULQDQ support. This implementation is accelerated using PCLMULQDQ instructions to perform the finite field computations. For added efficiency, 8 blocks of the message are processed simultaneously by precomputing the first 8 powers of the key. Schoolbook multiplication is used instead of Karatsuba multiplication because it was found to be slightly faster on x86-64 machines. Montgomery reduction must be used instead of Barrett reduction due to the difference in modulus between POLYVAL's field and other finite fields. More information on POLYVAL can be found in the HCTR2 paper: "Length-preserving encryption with HCTR2": https://eprint.iacr.org/2021/1441.pdf Signed-off-by: Nathan Huckleberry <nhuck@google.com> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-01-28	crypto: x86/sm3 - add AVX assembly implementation	Tianjia Zhang
	This patch adds AVX assembly accelerated implementation of SM3 secure hash algorithm. From the benchmark data, compared to pure software implementation sm3-generic, the performance increase is up to 38%. The main algorithm implementation based on SM3 AES/BMI2 accelerated work by libgcrypt at: https://gnupg.org/software/libgcrypt/index.html Benchmark on Intel i5-6200U 2.30GHz, performance data of two implementations, pure software sm3-generic and sm3-avx acceleration. The data comes from the 326 mode and 422 mode of tcrypt. The abscissas are different lengths of per update. The data is tabulated and the unit is Mb/s: update-size \| 16 64 256 1024 2048 4096 8192 ------------+------------------------------------------------------- sm3-generic \| 105.97 129.60 182.12 189.62 188.06 193.66 194.88 sm3-avx \| 119.87 163.05 244.44 260.92 257.60 264.87 265.88 Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-01-07	lib/crypto: blake2s: include as built-in	Jason A. Donenfeld
	In preparation for using blake2s in the RNG, we change the way that it is wired-in to the build system. Instead of using ifdefs to select the right symbol, we use weak symbols. And because ARM doesn't need the generic implementation, we make the generic one default only if an arch library doesn't need it already, and then have arch libraries that do need it opt-in. So that the arch libraries can remain tristate rather than bool, we then split the shash part from the glue code. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: linux-kbuild@vger.kernel.org Cc: linux-crypto@vger.kernel.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-08-27	crypto: x86/sm4 - add AES-NI/AVX2/x86_64 implementation	Tianjia Zhang
	Like the implementation of AESNI/AVX, this patch adds an accelerated implementation of AESNI/AVX2. In terms of code implementation, by reusing AESNI/AVX mode-related codes, the amount of code is greatly reduced. From the benchmark data, it can be seen that when the block size is 1024, compared to AVX acceleration, the performance achieved by AVX2 has increased by about 70%, it is also 7.7 times of the pure software implementation of sm4-generic. The main algorithm implementation comes from SM4 AES-NI work by libgcrypt and Markku-Juhani O. Saarinen at: https://github.com/mjosaarinen/sm4ni This optimization supports the four modes of SM4, ECB, CBC, CFB, and CTR. Since CBC and CFB do not support multiple block parallel encryption, the optimization effect is not obvious. Benchmark on Intel i5-6200U 2.30GHz, performance data of three implementation methods, pure software sm4-generic, aesni/avx acceleration, and aesni/avx2 acceleration, the data comes from the 218 mode and 518 mode of tcrypt. The abscissas are blocks of different lengths. The data is tabulated and the unit is Mb/s: block-size \| 16 64 128 256 1024 1420 4096 sm4-generic ECB enc \| 60.94 70.41 72.27 73.02 73.87 73.58 73.59 ECB dec \| 61.87 70.53 72.15 73.09 73.89 73.92 73.86 CBC enc \| 56.71 66.31 68.05 69.84 70.02 70.12 70.24 CBC dec \| 54.54 65.91 68.22 69.51 70.63 70.79 70.82 CFB enc \| 57.21 67.24 69.10 70.25 70.73 70.52 71.42 CFB dec \| 57.22 64.74 66.31 67.24 67.40 67.64 67.58 CTR enc \| 59.47 68.64 69.91 71.02 71.86 71.61 71.95 CTR dec \| 59.94 68.77 69.95 71.00 71.84 71.55 71.95 sm4-aesni-avx ECB enc \| 44.95 177.35 292.06 316.98 339.48 322.27 330.59 ECB dec \| 45.28 178.66 292.31 317.52 339.59 322.52 331.16 CBC enc \| 57.75 67.68 69.72 70.60 71.48 71.63 71.74 CBC dec \| 44.32 176.83 284.32 307.24 328.61 312.61 325.82 CFB enc \| 57.81 67.64 69.63 70.55 71.40 71.35 71.70 CFB dec \| 43.14 167.78 282.03 307.20 328.35 318.24 325.95 CTR enc \| 42.35 163.32 279.11 302.93 320.86 310.56 317.93 CTR dec \| 42.39 162.81 278.49 302.37 321.11 310.33 318.37 sm4-aesni-avx2 ECB enc \| 45.19 177.41 292.42 316.12 339.90 322.53 330.54 ECB dec \| 44.83 178.90 291.45 317.31 339.85 322.55 331.07 CBC enc \| 57.66 67.62 69.73 70.55 71.58 71.66 71.77 CBC dec \| 44.34 176.86 286.10 501.68 559.58 483.87 527.46 CFB enc \| 57.43 67.60 69.61 70.52 71.43 71.28 71.65 CFB dec \| 43.12 167.75 268.09 499.33 558.35 490.36 524.73 CTR enc \| 42.42 163.39 256.17 493.95 552.45 481.58 517.19 CTR dec \| 42.49 163.11 256.36 493.34 552.62 481.49 516.83 Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-07-30	crypto: x86/sm4 - add AES-NI/AVX/x86_64 implementation	Tianjia Zhang
	This patch adds AES-NI/AVX/x86_64 assembler implementation of SM4 block cipher. Through two affine transforms, we can use the AES S-Box to simulate the SM4 S-Box to achieve the effect of instruction acceleration. The main algorithm implementation comes from SM4 AES-NI work by libgcrypt and Markku-Juhani O. Saarinen at: https://github.com/mjosaarinen/sm4ni This optimization supports the four modes of SM4, ECB, CBC, CFB, and CTR. Since CBC and CFB do not support multiple block parallel encryption, the optimization effect is not obvious. Benchmark on Intel Xeon Cascadelake, the data comes from the 218 mode and 518 mode of tcrypt. The abscissas are blocks of different lengths. The data is tabulated and the unit is Mb/s: sm4-generic \| 16 64 128 256 1024 1420 4096 ECB enc \| 40.99 46.50 48.05 48.41 49.20 49.25 49.28 ECB dec \| 41.07 46.99 48.15 48.67 49.20 49.25 49.29 CBC enc \| 37.71 45.28 46.77 47.60 48.32 48.37 48.40 CBC dec \| 36.48 44.82 46.43 47.45 48.23 48.30 48.36 CFB enc \| 37.94 44.84 46.12 46.94 47.57 47.46 47.68 CFB dec \| 37.50 42.84 43.74 44.37 44.85 44.80 44.96 CTR enc \| 39.20 45.63 46.75 47.49 48.09 47.85 48.08 CTR dec \| 39.64 45.70 46.72 47.47 47.98 47.88 48.06 sm4-aesni-avx ECB enc \| 33.75 134.47 221.64 243.43 264.05 251.58 258.13 ECB dec \| 34.02 134.92 223.11 245.14 264.12 251.04 258.33 CBC enc \| 38.85 46.18 47.67 48.34 49.00 48.96 49.14 CBC dec \| 33.54 131.29 223.88 245.27 265.50 252.41 263.78 CFB enc \| 38.70 46.10 47.58 48.29 49.01 48.94 49.19 CFB dec \| 32.79 128.40 223.23 244.87 265.77 253.31 262.79 CTR enc \| 32.58 122.23 220.29 241.16 259.57 248.32 256.69 CTR dec \| 32.81 122.47 218.99 241.54 258.42 248.58 256.61 Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-04-19	x86/crypto: Enable objtool in crypto code	Josh Poimboeuf
	Now that all the stack alignment prologues have been cleaned up in the crypto code, enable objtool. Among other benefits, this will allow ORC unwinding to work. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Tested-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Link: https://lore.kernel.org/r/fc2a1918c50e33e46ef0e9a5de02743f2f6e3639.1614182415.git.jpoimboe@redhat.com
2021-01-14	crypto: x86 - remove glue helper module	Ard Biesheuvel
	All dependencies on the x86 glue helper module have been replaced by local instantiations of the new ECB/CBC preprocessor helper macros, so the glue helper module can be retired. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-04-09	x86: update AS_* macros to binutils >=2.23, supporting ADX and AVX2	Jason A. Donenfeld
	Now that the kernel specifies binutils 2.23 as the minimum version, we can remove ifdefs for AVX2 and ADX throughout. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-04-09	crypto: x86 - clean up poly1305-x86_64-cryptogams.S by 'make clean'	Masahiro Yamada
	poly1305-x86_64-cryptogams.S is a generated file, so it should be cleaned up by 'make clean'. Assigning it to the variable 'targets' teaches Kbuild that it is a generated file. However, this line is not evaluated when cleaning because scripts/Makefile.clean does not include include/config/auto.conf. Remove the ifneq-conditional, so this file is correctly cleaned up. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Ingo Molnar <mingo@kernel.org>
2020-04-09	crypto: x86 - rework configuration based on Kconfig	Jason A. Donenfeld
	Now that assembler capabilities are probed inside of Kconfig, we can set up proper Kconfig-based dependencies. We also take this opportunity to reorder the Makefile, so that items are grouped logically by primitive. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-04-09	x86: remove always-defined CONFIG_AS_AVX	Masahiro Yamada
	CONFIG_AS_AVX was introduced by commit ea4d26ae24e5 ("raid5: add AVX optimized RAID5 checksumming"). We raise the minimal supported binutils version from time to time. The last bump was commit 1fb12b35e5ff ("kbuild: Raise the minimum required binutils version to 2.21"). I confirmed the code in $(call as-instr,...) can be assembled by the binutils 2.21 assembler and also by LLVM integrated assembler. Remove CONFIG_AS_AVX, which is always defined. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Ingo Molnar <mingo@kernel.org>
2020-03-05	crypto: x86/curve25519 - support assemblers with no adx support	Jason A. Donenfeld
	Some older version of GAS do not support the ADX instructions, similarly to how they also don't support AVX and such. This commit adds the same build-time detection mechanisms we use for AVX and others for ADX, and then makes sure that the curve25519 library dispatcher calls the right functions. Reported-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-16	crypto: x86/poly1305 - wire up faster implementations for kernel	Jason A. Donenfeld
	These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F. The AVX-512F implementation is disabled on Skylake, due to throttling, but it is quite fast on >= Cannonlake. On the left is cycle counts on a Core i7 6700HQ using the AVX-2 codepath, comparing this implementation ("new") to the implementation in the current crypto api ("old"). On the right are benchmarks on a Xeon Gold 5120 using the AVX-512 codepath. The new implementation is faster on all benchmarks. AVX-2 AVX-512 --------- ----------- size old new size old new ---- ---- ---- ---- ---- ---- 0 70 68 0 74 70 16 92 90 16 96 92 32 134 104 32 136 106 48 172 120 48 184 124 64 218 136 64 218 138 80 254 158 80 260 160 96 298 174 96 300 176 112 342 192 112 342 194 128 388 212 128 384 212 144 428 228 144 420 226 160 466 246 160 464 248 176 510 264 176 504 264 192 550 282 192 544 282 208 594 302 208 582 300 224 628 316 224 624 318 240 676 334 240 662 338 256 716 354 256 708 358 272 764 374 272 748 372 288 802 352 288 788 358 304 420 366 304 422 370 320 428 360 320 432 364 336 484 378 336 486 380 352 426 384 352 434 390 368 478 400 368 480 408 384 488 394 384 490 398 400 542 408 400 542 412 416 486 416 416 492 426 432 534 430 432 538 436 448 544 422 448 546 432 464 600 438 464 600 448 480 540 448 480 548 456 496 594 464 496 594 476 512 602 456 512 606 470 528 656 476 528 656 480 544 600 480 544 606 498 560 650 494 560 652 512 576 664 490 576 662 508 592 714 508 592 716 522 608 656 514 608 664 538 624 708 532 624 710 552 640 716 524 640 720 516 656 770 536 656 772 526 672 716 548 672 722 544 688 770 562 688 768 556 704 774 552 704 778 556 720 826 568 720 832 568 736 768 574 736 780 584 752 822 592 752 826 600 768 830 584 768 836 560 784 884 602 784 888 572 800 828 610 800 838 588 816 884 628 816 884 604 832 888 618 832 894 598 848 942 632 848 946 612 864 884 644 864 896 628 880 936 660 880 942 644 896 948 652 896 952 608 912 1000 664 912 1004 616 928 942 676 928 954 634 944 994 690 944 1000 646 960 1002 680 960 1008 646 976 1054 694 976 1062 658 992 1002 706 992 1012 674 1008 1052 720 1008 1058 690 This commit wires in the prior implementation from Andy, and makes the following changes to be suitable for kernel land. - Some cosmetic and structural changes, like renaming labels to .Lname, constants, and other Linux conventions, as well as making the code easy for us to maintain moving forward. - CPU feature checking is done in C by the glue code. - We avoid jumping into the middle of functions, to appease objtool, and instead parameterize shared code. - We maintain frame pointers so that stack traces make sense. - We remove the dependency on the perl xlate code, which transforms the output into things that assemblers we don't care about use. Importantly, none of our changes affect the arithmetic or core code, but just involve the differing environment of kernel space. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Samuel Neves <sneves@dei.uc.pt> Co-developed-by: Samuel Neves <sneves@dei.uc.pt> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-11-17	crypto: curve25519 - x86_64 library and KPP implementations	Jason A. Donenfeld
	This implementation is the fastest available x86_64 implementation, and unlike Sandy2x, it doesn't requie use of the floating point registers at all. Instead it makes use of BMI2 and ADX, available on recent microarchitectures. The implementation was written by Armando Faz-Hernández with contributions (upstream) from Samuel Neves and me, in addition to further changes in the kernel implementation from us. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Samuel Neves <sneves@dei.uc.pt> Co-developed-by: Samuel Neves <sneves@dei.uc.pt> [ardb: - move to arch/x86/crypto - wire into lib/crypto framework - implement crypto API KPP hooks ] Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-11-17	crypto: blake2s - x86_64 SIMD implementation	Jason A. Donenfeld
	These implementations from Samuel Neves support AVX and AVX-512VL. Originally this used AVX-512F, but Skylake thermal throttling made AVX-512VL more attractive and possible to do with negligable difference. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Samuel Neves <sneves@dei.uc.pt> Co-developed-by: Samuel Neves <sneves@dei.uc.pt> [ardb: move to arch/x86/crypto, wire into lib/crypto framework] Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-07-26	crypto: aegis128l/aegis256 - remove x86 and generic implementations	Ard Biesheuvel
	Three variants of AEGIS were proposed for the CAESAR competition, and only one was selected for the final portfolio: AEGIS128. The other variants, AEGIS128L and AEGIS256, are not likely to ever turn up in networking protocols or other places where interoperability between Linux and other systems is a concern, nor are they likely to be subjected to further cryptanalysis. However, uninformed users may think that AEGIS128L (which is faster) is equally fit for use. So let's remove them now, before anyone starts using them and we are forced to support them forever. Note that there are no known flaws in the algorithms or in any of these implementations, but they have simply outlived their usefulness. Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-07-26	crypto: morus - remove generic and x86 implementations	Ard Biesheuvel
	MORUS was not selected as a winner in the CAESAR competition, which is not surprising since it is considered to be cryptographically broken [0]. (Note that this is not an implementation defect, but a flaw in the underlying algorithm). Since it is unlikely to be in use currently, let's remove it before we're stuck with it. [0] https://eprint.iacr.org/2019/172.pdf Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-07-26	crypto: x86/aes - drop scalar assembler implementations	Ard Biesheuvel
	The AES assembler code for x86 isn't actually faster than code generated by the compiler from aes_generic.c, and considering the disproportionate maintenance burden of assembler code on x86, it is better just to drop it entirely. Modern x86 systems will use AES-NI anyway, and given that the modules being removed have a dependency on aes_generic already, we can remove them without running the risk of regressions. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-12-13	crypto: x86/chacha20 - refactor to allow varying number of rounds	Eric Biggers
	In preparation for adding XChaCha12 support, rename/refactor the x86_64 SIMD implementations of ChaCha20 to support different numbers of rounds. Reviewed-by: Martin Willi <martin@strongswan.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-12-13	crypto: x86/nhpoly1305 - add AVX2 accelerated NHPoly1305	Eric Biggers
	Add a 64-bit AVX2 implementation of NHPoly1305, an ε-almost-∆-universal hash function used in the Adiantum encryption mode. For now, only the NH portion is actually AVX2-accelerated; the Poly1305 part is less performance-critical so is just implemented in C. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-12-13	crypto: x86/nhpoly1305 - add SSE2 accelerated NHPoly1305	Eric Biggers
	Add a 64-bit SSE2 implementation of NHPoly1305, an ε-almost-∆-universal hash function used in the Adiantum encryption mode. For now, only the NH portion is actually SSE2-accelerated; the Poly1305 part is less performance-critical so is just implemented in C. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-11-29	crypto: x86/chacha20 - Add a 8-block AVX-512VL variant	Martin Willi
	This variant is similar to the AVX2 version, but benefits from the AVX-512 rotate instructions and the additional registers, so it can operate without any data on the stack. It uses ymm registers only to avoid the massive core throttling on Skylake-X platforms. Nontheless does it bring a ~30% speed improvement compared to the AVX2 variant for random encryption lengths. The AVX2 version uses "rep movsb" for partial block XORing via the stack. With AVX-512, the new "vmovdqu8" can do this much more efficiently. The associated "kmov" instructions to work with dynamic masks is not part of the AVX-512VL instruction set, hence we depend on AVX-512BW as well. Given that the major AVX-512VL architectures provide AVX-512BW and this extension does not affect core clocking, this seems to be no problem at least for now. Signed-off-by: Martin Willi <martin@strongswan.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-10-05	crypto: x86/aes-ni - remove special handling of AES in PCBC mode	Ard Biesheuvel
	For historical reasons, the AES-NI based implementation of the PCBC chaining mode uses a special FPU chaining mode wrapper template to amortize the FPU start/stop overhead over multiple blocks. When this FPU wrapper was introduced, it supported widely used chaining modes such as XTS and CTR (as well as LRW), but currently, PCBC is the only remaining user. Since there are no known users of pcbc(aes) in the kernel, let's remove this special driver, and rely on the generic pcbc driver to encapsulate the AES-NI core cipher. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-09-04	crypto: x86 - remove SHA multibuffer routines and mcryptd	Ard Biesheuvel
	As it turns out, the AVX2 multibuffer SHA routines are currently broken [0], in a way that would have likely been noticed if this code were in wide use. Since the code is too complicated to be maintained by anyone except the original authors, and since the performance benefits for real-world use cases are debatable to begin with, it is better to drop it entirely for the moment. [0] https://marc.info/?l=linux-crypto-vger&m=153476243825350&w=2 Suggested-by: Eric Biggers <ebiggers@google.com> Cc: Megha Dey <megha.dey@linux.intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>