API:
- Add sig driver API.
- Remove signing/verification from akcipher API.
- Move crypto_simd_disabled_for_test to lib/crypto.
- Add WARN_ON for return values from driver that indicates memory corruption.
Algorithms:
- Provide crc32-arch and crc32c-arch through Crypto API.
- Optimise crc32c code size on x86.
- Optimise crct10dif on arm/arm64.
- Optimise p10-aes-gcm on powerpc.
- Optimise aegis128 on x86.
- Output full sample from test interface in jitter RNG.
- Retry without padata when it fails in pcrypt.
Drivers:
- Add support for Airoha EN7581 TRNG.
- Add support for STM32MP25x platforms in stm32.
- Enable iproc-r200 RNG driver on BCMBCA.
- Add Broadcom BCM74110 RNG driver.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmc6sQsACgkQxycdCkmx
i6dfHxAAnkI65TE6agZq9DlkEU4ZqOsxxdk0MsGIhbCUTxW3KENzu9vtKjnvg9T/
Ou0d2J49ny87Y4zaA59Wf/Q1+gg5YSQR5kelonpfrPLkCkJjr72HZpyCHv8TTzEC
uHHoVj9cnPIF5/yfiqQsrWT1ACip9vn+slyVPaMJV1qR6gnvnSALtsg4e/vKHkn7
ZMaf2pZ2ROYXdB02nMK5KQcCrxD64MQle/yQepY44eYjnT+XclkqPdi6o1nUSpj/
RFAeY0jFSTu0pj3DqT48TnU/LiiNLlFOZrGjCdEySoac63vmTtKqfYDmrRaFz4hB
sucxbgJ3xnnYseRijtfXnxaD/IkDJln+ipGNQKAZLfOVMDCTxPdYGmOpobMTXMS+
0sY0eAHgqr23P9pOp+sOzcAEFIqg6llAYQVWx3Zl4vpXBUuxzg6AqmHnPicnck7y
Lw1cJhQxij2De3dG2ZL/0dgQxMjGN/YfCM8SSg6l+Xn3j4j47rqJNH2ZsmXtbJ2n
kTkmemmWdgRR1IvgQQGsvyKs9ThkcEDW+IzW26SUv3Clvru2NSkX4ZPHbezZQf+D
R0wMZsW3Fw7Zymerz1GIBSqdLnsyFWtIAjukDpOR6ordPgOBeDt76v6tw5vL2/II
KYoeN1pdEEecwuhAsEvCryT5ZG4noBeNirf/ElWAfEybgcXiTks=
=T8pa
-----END PGP SIGNATURE-----
Merge tag 'v6.13-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Add sig driver API
- Remove signing/verification from akcipher API
- Move crypto_simd_disabled_for_test to lib/crypto
- Add WARN_ON for return values from driver that indicates memory
corruption
Algorithms:
- Provide crc32-arch and crc32c-arch through Crypto API
- Optimise crc32c code size on x86
- Optimise crct10dif on arm/arm64
- Optimise p10-aes-gcm on powerpc
- Optimise aegis128 on x86
- Output full sample from test interface in jitter RNG
- Retry without padata when it fails in pcrypt
Drivers:
- Add support for Airoha EN7581 TRNG
- Add support for STM32MP25x platforms in stm32
- Enable iproc-r200 RNG driver on BCMBCA
- Add Broadcom BCM74110 RNG driver"
* tag 'v6.13-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (112 commits)
crypto: marvell/cesa - fix uninit value for struct mv_cesa_op_ctx
crypto: cavium - Fix an error handling path in cpt_ucode_load_fw()
crypto: aesni - Move back to module_init
crypto: lib/mpi - Export mpi_set_bit
crypto: aes-gcm-p10 - Use the correct bit to test for P10
hwrng: amd - remove reference to removed PPC_MAPLE config
crypto: arm/crct10dif - Implement plain NEON variant
crypto: arm/crct10dif - Macroify PMULL asm code
crypto: arm/crct10dif - Use existing mov_l macro instead of __adrl
crypto: arm64/crct10dif - Remove remaining 64x64 PMULL fallback code
crypto: arm64/crct10dif - Use faster 16x64 bit polynomial multiply
crypto: arm64/crct10dif - Remove obsolete chunking logic
crypto: bcm - add error check in the ahash_hmac_init function
crypto: caam - add error check to caam_rsa_set_priv_key_form
hwrng: bcm74110 - Add Broadcom BCM74110 RNG driver
dt-bindings: rng: add binding for BCM74110 RNG
padata: Clean up in padata_do_multithreaded()
crypto: inside-secure - Fix the return value of safexcel_xcbcmac_cra_init()
crypto: qat - Fix missing destroy_workqueue in adf_init_aer()
crypto: rsassa-pkcs1 - Reinstate support for legacy protocols
...
This patch reverts commit 0fbafd06bd
("crypto: aesni - fix failing setkey for rfc4106-gcm-aesni") by
moving the aesni init function back to module_init from late_initcall.
The original patch was needed because tests were synchronous. This
is no longer the case so there is no need to postpone the registration.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Remove returns that are immediately followed by another return.
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Stop using FRAME_BEGIN and FRAME_END in the AEGIS assembly functions,
since all these functions are now leaf functions. This eliminates some
unnecessary instructions.
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Update a caller of aegis128_aesni_ad() to round down the length to a
block boundary. After that, aegis128_aesni_ad(), aegis128_aesni_enc(),
and aegis128_aesni_dec() are only passed whole blocks. Update the
assembly code to take advantage of that, which eliminates some unneeded
instructions. For aegis128_aesni_enc() and aegis128_aesni_dec(), the
length is also always nonzero, so stop checking for zero length.
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Optimize the code that loads and stores partial blocks, taking advantage
of SSE4.1. The code is adapted from that in aes-gcm-aesni-x86_64.S.
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Adjust the prototypes of the AEGIS assembly functions:
- Use proper types instead of 'void *', when applicable.
- Move the length parameter to after the buffers it describes rather
than before, to match the usual convention. Also shorten its name to
just len (which is the name used in the assembly code).
- Declare register aliases at the beginning of each function rather than
once per file. This was necessary because len was moved, but also it
allows adding some aliases where raw registers were used before.
- Put assoclen and cryptlen in the correct order when declaring the
finalization function in the .c file.
- Remove the unnecessary "crypto_" prefix.
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Start using SSE4.1 instructions in the AES-NI AEGIS code, with the first
use case being preparing the length block in fewer instructions.
In practice this does not reduce the set of CPUs on which the code can
run, because all Intel and AMD CPUs with AES-NI also have SSE4.1.
Upgrade the existing SSE2 feature check to SSE4.1, though it seems this
check is not strictly necessary; the aesni-intel module has been getting
away with using SSE4.1 despite checking for AES-NI only.
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Remove the AEGIS assembly code paths that were "optimized" to operate on
16-byte aligned data using movdqa, and instead just use the code paths
that use movdqu and can handle data with any alignment.
This does not reduce performance. movdqa is basically a historical
artifact; on aligned data, movdqu and movdqa have had the same
performance since Intel Nehalem (2008) and AMD Bulldozer (2011). And
code that requires AES-NI cannot run on CPUs older than those anyway.
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Instead of using a struct of function pointers to decide whether to call
the encryption or decryption assembly functions, use a conditional
branch on a bool. Force-inline the functions to avoid actually
generating the branch. This improves performance slightly since
indirect calls are slow. Remove the now-unnecessary CFI stubs.
Note that just force-inlining the existing functions might cause the
compiler to optimize out the indirect branches, but that would not be a
reliable way to do it and the CFI stubs would still be required.
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Don't bother providing empty stubs for the init and exit methods in
struct aead_alg, since they are optional anyway.
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Fix the AEGIS assembly code to access 'unsigned int' arguments as 32-bit
values instead of 64-bit, since the upper bits of the corresponding
64-bit registers are not guaranteed to be zero.
Note: there haven't been any reports of this bug actually causing
incorrect behavior. Neither gcc nor clang guarantee zero-extension to
64 bits, but zero-extension is likely to happen in practice because most
instructions that operate on 32-bit registers zero-extend to 64 bits.
Fixes: 1d373d4e8e ("crypto: x86 - Add optimized AEGIS implementations")
Cc: stable@vger.kernel.org
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
crc32c-pcl-intel-asm_64.S has a loop with 1 to 127 iterations fully
unrolled and uses a jump table to jump into the correct location. This
optimization is misguided, as it bloats the binary code size and
introduces an indirect call. x86_64 CPUs can predict loops well, so it
is fine to just use a loop instead. Loop bookkeeping instructions can
compete with the crc instructions for the ALUs, but this is easily
mitigated by unrolling the loop by a smaller amount, such as 4 times.
Therefore, re-roll the loop and make related tweaks to the code.
This reduces the binary code size of crc_pclmul() from 4546 bytes to 418
bytes, a 91% reduction. In general it also makes the code faster, with
some large improvements seen when retpoline is enabled.
More detailed performance results are shown below. They are given as
percent improvement in throughput (negative means regressed) for CPU
microarchitecture vs. input length in bytes. E.g. an improvement from
40 GB/s to 50 GB/s would be listed as 25%.
Table 1: Results with retpoline enabled (the default):
| 512 | 833 | 1024 | 2000 | 3173 | 4096 |
---------------------+-------+-------+-------+------ +-------+-------+
Intel Haswell | 35.0% | 20.7% | 17.8% | 9.7% | -0.2% | 4.4% |
Intel Emerald Rapids | 66.8% | 45.2% | 36.3% | 19.3% | 0.0% | 5.4% |
AMD Zen 2 | 29.5% | 17.2% | 13.5% | 8.6% | -0.5% | 2.8% |
Table 2: Results with retpoline disabled:
| 512 | 833 | 1024 | 2000 | 3173 | 4096 |
---------------------+-------+-------+-------+------ +-------+-------+
Intel Haswell | 3.3% | 4.8% | 4.5% | 0.9% | -2.9% | 0.3% |
Intel Emerald Rapids | 7.5% | 6.4% | 5.2% | 2.3% | -0.0% | 0.6% |
AMD Zen 2 | 11.8% | 1.4% | 0.2% | 1.3% | -0.9% | -0.2% |
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Fix crc32c-pcl-intel-asm_64.S to access 32-bit arguments as 32-bit
values instead of 64-bit, since the upper bits of the corresponding
64-bit registers are not guaranteed to be zero. Also update the type of
the length argument to be unsigned int rather than int, as the assembly
code treats it as unsigned.
Note: there haven't been any reports of this bug actually causing
incorrect behavior. Neither gcc nor clang guarantee zero-extension to
64 bits, but zero-extension is likely to happen in practice because most
instructions that operate on 32-bit registers zero-extend to 64 bits.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The assembly code in crc32c-pcl-intel-asm_64.S is invoked only for
lengths >= 512, due to the overhead of saving and restoring FPU state.
Therefore, it is unnecessary for this code to be excessively "optimized"
for lengths < 200. Eliminate the excessive unrolling of this part of
the code and use a more straightforward qword-at-a-time loop.
Note: the part of the code in question is not entirely redundant, as it
is still used to process any remainder mod 24, as well as any remaining
data when fewer than 200 bytes remain after least one 3072-byte chunk.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
commit e2d60e2f59 ("crypto: x86/cast5 - drop CTR mode implementation")
removed the calls to cast5_ctr_16way but left the avx implementation.
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.
auto-generated by the following:
for i in `git grep -l -w asm/unaligned.h`; do
sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
Update the kconfig help and module description to reflect that VAES
instructions are now used in some cases. Also fix XTR => XCTR.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The macros FOUR_ROUNDS_AND_SCHED and DO_4ROUNDS rely on an
unexpected/undocumented behavior of the GNU assembler, which might
change in the future
(https://sourceware.org/bugzilla/show_bug.cgi?id=32073).
M (1) (2) // 1 arg !? Future: 2 args
M 1 + 2 // 1 arg !? Future: 3 args
M 1 2 // 2 args
Add parentheses around the single arguments to support future GNU
assembler and LLVM integrated assembler (when the IsOperator hack from
the following link is dropped).
Link: 055006475e
Signed-off-by: Fangrui Song <maskray@google.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
On PREEMPT_RT, kfree() takes sleeping locks and must not be called with
preemption disabled. Therefore, on PREEMPT_RT skcipher_walk_done() must
not be called from within a kernel_fpu_{begin,end}() pair, even when
it's the last call which is guaranteed to not allocate memory.
Therefore, move the last skcipher_walk_done() in gcm_crypt() to the end
of the function so that it goes after the kernel_fpu_end(). To make
this work cleanly, rework the data processing loop to handle only
non-last data segments.
Fixes: b06affb1cb ("crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM")
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Closes: https://lore.kernel.org/linux-crypto/20240802102333.itejxOsJ@linutronix.de
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Rewrite the AES-NI implementations of AES-GCM, taking advantage of
things I learned while writing the VAES-AVX10 implementations. This is
a complete rewrite that reduces the AES-NI GCM source code size by about
70% and the binary code size by about 95%, while not regressing
performance and in fact improving it significantly in many cases.
The following summarizes the state before this patch:
- The aesni-intel module registered algorithms "generic-gcm-aesni" and
"rfc4106-gcm-aesni" with the crypto API that actually delegated to one
of three underlying implementations according to the CPU capabilities
detected at runtime: AES-NI, AES-NI + AVX, or AES-NI + AVX2.
- The AES-NI + AVX and AES-NI + AVX2 assembly code was in
aesni-intel_avx-x86_64.S and consisted of 2804 lines of source and
257 KB of binary. This massive binary size was not really
appropriate, and depending on the kconfig it could take up over 1% the
size of the entire vmlinux. The main loops did 8 blocks per
iteration. The AVX code minimized the use of carryless multiplication
whereas the AVX2 code did not. The "AVX2" code did not actually use
AVX2; the check for AVX2 was really a check for Intel Haswell or later
to detect support for fast carryless multiplication. The long source
length was caused by factors such as significant code duplication.
- The AES-NI only assembly code was in aesni-intel_asm.S and consisted
of 1501 lines of source and 15 KB of binary. The main loops did 4
blocks per iteration and minimized the use of carryless multiplication
by using Karatsuba multiplication and a multiplication-less reduction.
- The assembly code was contributed in 2010-2013. Maintenance has been
sporadic and most design choices haven't been revisited.
- The assembly function prototypes and the corresponding glue code were
separate from and were not consistent with the new VAES-AVX10 code I
recently added. The older code had several issues such as not
precomputing the GHASH key powers, which hurt performance.
This rewrite achieves the following goals:
- Much shorter source and binary sizes. The assembly source shrinks
from 4300 lines to 1130 lines, and it produces about 9 KB of binary
instead of 272 KB. This is achieved via a better designed AES-GCM
implementation that doesn't excessively unroll the code and instead
prioritizes the parts that really matter. Sharing the C glue code
with the VAES-AVX10 implementations also saves 250 lines of C source.
- Improve performance on most (possibly all) CPUs on which this code
runs, for most (possibly all) message lengths. Benchmark results are
given in Tables 1 and 2 below.
- Use the same function prototypes and glue code as the new VAES-AVX10
algorithms. This fixes some issues with the integration of the
assembly and results in some significant performance improvements,
primarily on short messages. Also, the AVX and non-AVX
implementations are now registered as separate algorithms with the
crypto API, which makes them both testable by the self-tests.
- Keep support for AES-NI without AVX (for Westmere, Silvermont,
Goldmont, and Tremont), but unify the source code with AES-NI + AVX.
Since 256-bit vectors cannot be used without VAES anyway, this is made
feasible by just using the non-VEX coded form of most instructions.
- Use a unified approach where the main loop does 8 blocks per iteration
and uses Karatsuba multiplication to save one pclmulqdq per block but
does not use the multiplication-less reduction. This strikes a good
balance across the range of CPUs on which this code runs.
- Don't spam the kernel log with an informational message on every boot.
The following tables summarize the improvement in AES-GCM throughput on
various CPU microarchitectures as a result of this patch:
Table 1: AES-256-GCM encryption throughput improvement,
CPU microarchitecture vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
-------------------+-------+-------+-------+-------+-------+-------+
Intel Broadwell | 2% | 8% | 11% | 18% | 31% | 26% |
Intel Skylake | 1% | 4% | 7% | 12% | 26% | 19% |
Intel Cascade Lake | 3% | 8% | 10% | 18% | 33% | 24% |
AMD Zen 1 | 6% | 12% | 6% | 15% | 27% | 24% |
AMD Zen 2 | 8% | 13% | 13% | 19% | 26% | 28% |
AMD Zen 3 | 8% | 14% | 13% | 19% | 26% | 25% |
| 300 | 200 | 64 | 63 | 16 |
-------------------+-------+-------+-------+-------+-------+
Intel Broadwell | 35% | 29% | 45% | 55% | 54% |
Intel Skylake | 25% | 19% | 28% | 33% | 27% |
Intel Cascade Lake | 36% | 28% | 39% | 49% | 54% |
AMD Zen 1 | 27% | 22% | 23% | 29% | 26% |
AMD Zen 2 | 32% | 24% | 22% | 25% | 31% |
AMD Zen 3 | 30% | 24% | 22% | 23% | 26% |
Table 2: AES-256-GCM decryption throughput improvement,
CPU microarchitecture vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
-------------------+-------+-------+-------+-------+-------+-------+
Intel Broadwell | 3% | 8% | 11% | 19% | 32% | 28% |
Intel Skylake | 3% | 4% | 7% | 13% | 28% | 27% |
Intel Cascade Lake | 3% | 9% | 11% | 19% | 33% | 28% |
AMD Zen 1 | 15% | 18% | 14% | 20% | 36% | 33% |
AMD Zen 2 | 9% | 16% | 13% | 21% | 26% | 27% |
AMD Zen 3 | 8% | 15% | 12% | 18% | 23% | 23% |
| 300 | 200 | 64 | 63 | 16 |
-------------------+-------+-------+-------+-------+-------+
Intel Broadwell | 36% | 31% | 40% | 51% | 53% |
Intel Skylake | 28% | 21% | 23% | 30% | 30% |
Intel Cascade Lake | 36% | 29% | 36% | 47% | 53% |
AMD Zen 1 | 35% | 31% | 32% | 35% | 36% |
AMD Zen 2 | 31% | 30% | 27% | 38% | 30% |
AMD Zen 3 | 27% | 23% | 24% | 32% | 26% |
The above numbers are percentage improvements in single-thread
throughput, so e.g. an increase from 3000 MB/s to 3300 MB/s would be
listed as 10%. They were collected by directly measuring the Linux
crypto API performance using a custom kernel module. Note that indirect
benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O)
include more overhead and won't see quite as much of a difference. All
these benchmarks used an associated data length of 16 bytes. Note that
AES-GCM is almost always used with short associated data lengths.
I didn't test Intel CPUs before Broadwell, AMD CPUs before Zen 1, or
Intel low-power CPUs, as these weren't readily available to me.
However, based on the design of the new code and the available
information about these other CPU microarchitectures, I wouldn't expect
any significant regressions, and there's a good chance performance is
improved just as it is above.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add implementations of AES-GCM for x86_64 CPUs that support VAES (vector
AES), VPCLMULQDQ (vector carryless multiplication), and either AVX512 or
AVX10. There are two implementations, sharing most source code: one
using 256-bit vectors and one using 512-bit vectors. This patch
improves AES-GCM performance by up to 162%; see Tables 1 and 2 below.
I wrote the new AES-GCM assembly code from scratch, focusing on
correctness, performance, code size (both source and binary), and
documenting the source. The new assembly file aes-gcm-avx10-x86_64.S is
about 1200 lines including extensive comments, and it generates less
than 8 KB of binary code. The main loop does 4 vectors at a time, with
the AES and GHASH instructions interleaved. Any remainder is handled
using a simple 1 vector at a time loop, with masking.
Several VAES + AVX512 implementations of AES-GCM exist from Intel,
including one in OpenSSL and one proposed for inclusion in Linux in 2021
(https://lore.kernel.org/linux-crypto/1611386920-28579-6-git-send-email-megha.dey@intel.com/).
These aren't really suitable to be used, though, due to the massive
amount of binary code generated (696 KB for OpenSSL, 200 KB for Linux)
and well as the significantly larger amount of assembly source (4978
lines for OpenSSL, 1788 lines for Linux). Also, Intel's code does not
support 256-bit vectors, which makes it not usable on future
AVX10/256-only CPUs, and also not ideal for certain Intel CPUs that have
downclocking issues. So I ended up starting from scratch. Usually my
much shorter code is actually slightly faster than Intel's AVX512 code,
though it depends on message length and on which of Intel's
implementations is used; for details, see Tables 3 and 4 below.
To facilitate potential integration into other projects, I've
dual-licensed aes-gcm-avx10-x86_64.S under Apache-2.0 OR BSD-2-Clause,
the same as the recently added RISC-V crypto code.
The following two tables summarize the performance improvement over the
existing AES-GCM code in Linux that uses AES-NI and AVX2:
Table 1: AES-256-GCM encryption throughput improvement,
CPU microarchitecture vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
----------------------+-------+-------+-------+-------+-------+-------+
Intel Ice Lake | 42% | 48% | 60% | 62% | 70% | 69% |
Intel Sapphire Rapids | 157% | 145% | 162% | 119% | 96% | 96% |
Intel Emerald Rapids | 156% | 144% | 161% | 115% | 95% | 100% |
AMD Zen 4 | 103% | 89% | 78% | 56% | 54% | 54% |
| 300 | 200 | 64 | 63 | 16 |
----------------------+-------+-------+-------+-------+-------+
Intel Ice Lake | 66% | 48% | 49% | 70% | 53% |
Intel Sapphire Rapids | 80% | 60% | 41% | 62% | 38% |
Intel Emerald Rapids | 79% | 60% | 41% | 62% | 38% |
AMD Zen 4 | 51% | 35% | 27% | 32% | 25% |
Table 2: AES-256-GCM decryption throughput improvement,
CPU microarchitecture vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
----------------------+-------+-------+-------+-------+-------+-------+
Intel Ice Lake | 42% | 48% | 59% | 63% | 67% | 71% |
Intel Sapphire Rapids | 159% | 145% | 161% | 125% | 102% | 100% |
Intel Emerald Rapids | 158% | 144% | 161% | 124% | 100% | 103% |
AMD Zen 4 | 110% | 95% | 80% | 59% | 56% | 54% |
| 300 | 200 | 64 | 63 | 16 |
----------------------+-------+-------+-------+-------+-------+
Intel Ice Lake | 67% | 56% | 46% | 70% | 56% |
Intel Sapphire Rapids | 79% | 62% | 39% | 61% | 39% |
Intel Emerald Rapids | 80% | 62% | 40% | 58% | 40% |
AMD Zen 4 | 49% | 36% | 30% | 35% | 28% |
The above numbers are percentage improvements in single-thread
throughput, so e.g. an increase from 4000 MB/s to 6000 MB/s would be
listed as 50%. They were collected by directly measuring the Linux
crypto API performance using a custom kernel module. Note that indirect
benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O)
include more overhead and won't see quite as much of a difference. All
these benchmarks used an associated data length of 16 bytes. Note that
AES-GCM is almost always used with short associated data lengths.
The following two tables summarize how the performance of my code
compares with Intel's AVX512 AES-GCM code, both the version that is in
OpenSSL and the version that was proposed for inclusion in Linux.
Neither version exists in Linux currently, but these are alternative
AES-GCM implementations that could be chosen instead of mine. I
collected the following numbers on Emerald Rapids using a userspace
benchmark program that calls the assembly functions directly.
I've also included a comparison with Cloudflare's AES-GCM implementation
from https://boringssl-review.googlesource.com/c/boringssl/+/65987/3.
Table 3: VAES-based AES-256-GCM encryption throughput in MB/s,
implementation name vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
---------------------+-------+-------+-------+-------+-------+-------+
This implementation | 14171 | 12956 | 12318 | 9588 | 7293 | 6449 |
AVX512_Intel_OpenSSL | 14022 | 12467 | 11863 | 9107 | 5891 | 6472 |
AVX512_Intel_Linux | 13954 | 12277 | 11530 | 8712 | 6627 | 5898 |
AVX512_Cloudflare | 12564 | 11050 | 10905 | 8152 | 5345 | 5202 |
| 300 | 200 | 64 | 63 | 16 |
---------------------+-------+-------+-------+-------+-------+
This implementation | 4939 | 3688 | 1846 | 1821 | 738 |
AVX512_Intel_OpenSSL | 4629 | 4532 | 2734 | 2332 | 1131 |
AVX512_Intel_Linux | 4035 | 2966 | 1567 | 1330 | 639 |
AVX512_Cloudflare | 3344 | 2485 | 1141 | 1127 | 456 |
Table 4: VAES-based AES-256-GCM decryption throughput in MB/s,
implementation name vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
---------------------+-------+-------+-------+-------+-------+-------+
This implementation | 14276 | 13311 | 13007 | 11086 | 8268 | 8086 |
AVX512_Intel_OpenSSL | 14067 | 12620 | 12421 | 9587 | 5954 | 7060 |
AVX512_Intel_Linux | 14116 | 12795 | 11778 | 9269 | 7735 | 6455 |
AVX512_Cloudflare | 13301 | 12018 | 11919 | 9182 | 7189 | 6726 |
| 300 | 200 | 64 | 63 | 16 |
---------------------+-------+-------+-------+-------+-------+
This implementation | 6454 | 5020 | 2635 | 2602 | 1079 |
AVX512_Intel_OpenSSL | 5184 | 5799 | 2957 | 2545 | 1228 |
AVX512_Intel_Linux | 4394 | 4247 | 2235 | 1635 | 922 |
AVX512_Cloudflare | 4289 | 3851 | 1435 | 1417 | 574 |
So, usually my code is actually slightly faster than Intel's code,
though the OpenSSL implementation has a slight edge on messages shorter
than 256 bytes in this microbenchmark. (This also holds true when doing
the same tests on AMD Zen 4.) It can be seen that the large code size
(up to 94x larger!) of the Intel implementations doesn't seem to bring
much benefit, so starting from scratch with much smaller code, as I've
done, seems appropriate. The performance of my code on messages shorter
than 256 bytes could be improved through a limited amount of unrolling,
but it's unclear it would be worth it, given code size considerations
(e.g. caches) that don't get measured in microbenchmarks.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
On x86, make allmodconfig && make W=1 C=1 warns:
WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/crypto/crc32-pclmul.o
WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/crypto/curve25519-x86_64.o
Add the missing MODULE_DESCRIPTION() macro invocations.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
New CPU #defines encode vendor and family as well as model.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
New CPU #defines encode vendor and family as well as model.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
New CPU #defines encode vendor and family as well as model.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/20240520224620.9480-2-tony.luck@intel.com
Remove a redundant expansion of the AES key, and use rodata for zeroes.
Also rename rfc4106_set_hash_subkey() to aes_gcm_derive_hash_subkey()
because it's used for both versions of AES-GCM, not just RFC4106.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Delete aesni_gcm_enc() and aesni_gcm_dec() because they are unused.
Only the incremental AES-GCM functions (aesni_gcm_init(),
aesni_gcm_enc_update(), aesni_gcm_finalize()) are actually used.
This saves 17 KB of object code.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Since the total length processed by the loop in xts_crypt_slowpath() is
a multiple of AES_BLOCK_SIZE, just round the length down to
AES_BLOCK_SIZE even on the last step. This doesn't change behavior, as
the last step will process a multiple of AES_BLOCK_SIZE regardless.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
x86_64 has the "interesting" property that the instruction size is
generally a bit shorter for instructions that operate on the 32-bit (or
less) part of registers, or registers that are in the original set of 8.
This patch adjusts the AES-XTS code to take advantage of that property
by changing the LEN parameter from size_t to unsigned int (which is all
that's needed and is what the non-AVX implementation uses) and using the
%eax register for KEYLEN.
This decreases the size of aes-xts-avx-x86_64.o by 1.2%.
Note that changing the kmovq to kmovd was going to be needed anyway to
make the AVX10/256 code really work on CPUs that don't support 512-bit
vectors (since the AVX10 spec says that 64-bit opmask instructions will
only be supported on processors that support 512-bit vectors).
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
- For conditionally subtracting 16 from LEN when decrypting a message
whose length isn't a multiple of 16, use the cmovnz instruction.
- Fold the addition of 4*VL to LEN into the sub of VL or 16 from LEN.
- Remove an unnecessary test instruction.
This results in slightly shorter code, both source and binary.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Decrease the amount of code specific to the different AES variants by
"right-aligning" the sequence of round keys, and for AES-128 and AES-192
just skipping irrelevant rounds at the beginning.
This shrinks the size of aes-xts-avx-x86_64.o by 13.3%, and it improves
the efficiency of AES-128 and AES-192. The tradeoff is that for AES-256
some additional not-taken conditional jumps are now executed. But these
are predicted well and are cheap on x86.
Note that the ARMv8 CE based AES-XTS implementation uses a similar
strategy to handle the different AES variants.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Since aesni_xts_enc() and aesni_xts_dec() are very similar, generate
them from a macro that's passed an argument enc=1 or enc=0. This
reduces the length of aesni-intel_asm.S by 112 lines while still
producing the exact same object file in both 32-bit and 64-bit mode.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
When encrypting a message whose length isn't a multiple of 16 bytes,
encrypt the last full block in the main loop. This works because only
decryption uses the last two tweaks in reverse order, not encryption.
This improves the performance of decrypting messages whose length isn't
a multiple of the AES block length, shrinks the size of
aes-xts-avx-x86_64.o by 5.0%, and eliminates two instructions (a test
and a not-taken conditional jump) when encrypting a message whose length
*is* a multiple of the AES block length.
While it's not super useful to optimize for ciphertext stealing given
that it's rarely needed in practice, the other two benefits mentioned
above make this optimization worthwhile.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Instead of loading the message words into both MSG and \m0 and then
adding the round constants to MSG, load the message words into \m0 and
the round constants into MSG and then add \m0 to MSG. This shortens the
source code slightly. It changes the instructions slightly, but it
doesn't affect binary code size and doesn't seem to affect performance.
Suggested-by: Stefan Kanthak <stefan.kanthak@nexgo.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
- Load the SHA-256 round constants relative to a pointer that points
into the middle of the constants rather than to the beginning. Since
x86 instructions use signed offsets, this decreases the instruction
length required to access some of the later round constants.
- Use punpcklqdq or punpckhqdq instead of longer instructions such as
pshufd, pblendw, and palignr. This doesn't harm performance.
The end result is that sha256_ni_transform shrinks from 839 bytes to 791
bytes, with no loss in performance.
Suggested-by: Stefan Kanthak <stefan.kanthak@nexgo.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
MSGTMP[0-3] are used to hold the message schedule and are not temporary
registers per se. MSGTMP4 is used as a temporary register for several
different purposes and isn't really related to MSGTMP[0-3]. Rename them
to MSG[0-3] and TMP accordingly.
Also add a comment that clarifies what MSG is.
Suggested-by: Stefan Kanthak <stefan.kanthak@nexgo.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
To avoid source code duplication, do the SHA-256 rounds using macros.
This reduces the length of sha256_ni_asm.S by 153 lines while still
producing the exact same object file.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Access the AES round keys using offsets -7*16 through 7*16, instead of
0*16 through 14*16. This allows VEX-encoded instructions to address all
round keys using 1-byte offsets, whereas before some needed 4-byte
offsets. This decreases the code size of aes-xts-avx-x86_64.o by 4.2%.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Make the non-AVX implementation of AES-XTS (xts-aes-aesni) use the new
glue code that was introduced for the AVX implementations of AES-XTS.
This reduces code size, and it improves the performance of xts-aes-aesni
due to the optimization for messages that don't span page boundaries.
This required moving the new glue functions higher up in the file and
allowing the IV encryption function to be specified by the caller.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Since sha512_transform_rorx() uses ymm registers, execute vzeroupper
before returning from it. This is necessary to avoid reducing the
performance of SSE code.
Fixes: e01d69cb01 ("crypto: sha512 - Optimized SHA512 x86_64 assembly routine using AVX instructions.")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Since sha256_transform_rorx() uses ymm registers, execute vzeroupper
before returning from it. This is necessary to avoid reducing the
performance of SSE code.
Fixes: d34a460092 ("crypto: sha256 - Optimized sha256 x86_64 routine using AVX2's RORX instructions")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Since nh_avx2() uses ymm registers, execute vzeroupper before returning
from it. This is necessary to avoid reducing the performance of SSE
code.
Fixes: 0f961f9f67 ("crypto: x86/nhpoly1305 - add AVX2 accelerated NHPoly1305")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add an AES-XTS implementation "xts-aes-vaes-avx10_512" for x86_64 CPUs
with the VAES, VPCLMULQDQ, and either AVX10/512 or AVX512BW + AVX512VL
extensions. This implementation uses zmm registers to operate on four
AES blocks at a time. The assembly code is instantiated using a macro
so that most of the source code is shared with other implementations.
To avoid downclocking on older Intel CPU models, an exclusion list is
used to prevent this 512-bit implementation from being used by default
on some CPU models. They will use xts-aes-vaes-avx10_256 instead. For
now, this exclusion list is simply coded into aesni-intel_glue.c. It
may make sense to eventually move it into a more central location.
xts-aes-vaes-avx10_512 is slightly faster than xts-aes-vaes-avx10_256 on
some current CPUs. E.g., on AMD Zen 4, AES-256-XTS decryption
throughput increases by 13% with 4096-byte inputs, or 14% with 512-byte
inputs. On Intel Sapphire Rapids, AES-256-XTS decryption throughput
increases by 2% with 4096-byte inputs, or 3% with 512-byte inputs.
Future CPUs may provide stronger 512-bit support, in which case a larger
benefit should be seen.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add an AES-XTS implementation "xts-aes-vaes-avx10_256" for x86_64 CPUs
with the VAES, VPCLMULQDQ, and either AVX10/256 or AVX512BW + AVX512VL
extensions. This implementation avoids using zmm registers, instead
using ymm registers to operate on two AES blocks at a time. The
assembly code is instantiated using a macro so that most of the source
code is shared with other implementations.
This is the optimal implementation on CPUs that support VAES and AVX512
but where the zmm registers should not be used due to downclocking
effects, for example Intel's Ice Lake. It should also be the optimal
implementation on future CPUs that support AVX10/256 but not AVX10/512.
The performance is slightly better than that of xts-aes-vaes-avx2, which
uses the same 256-bit vector length, due to factors such as being able
to use ymm16-ymm31 to cache the AES round keys, and being able to use
the vpternlogd instruction to do XORs more efficiently. For example, on
Ice Lake, the throughput of decrypting 4096-byte messages with
AES-256-XTS is 6.6% higher with xts-aes-vaes-avx10_256 than with
xts-aes-vaes-avx2. While this is a small improvement, it is
straightforward to provide this implementation (xts-aes-vaes-avx10_256)
as long as we are providing xts-aes-vaes-avx2 and xts-aes-vaes-avx10_512
anyway, due to the way the _aes_xts_crypt macro is structured.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add an AES-XTS implementation "xts-aes-vaes-avx2" for x86_64 CPUs with
the VAES, VPCLMULQDQ, and AVX2 extensions, but not AVX512 or AVX10.
This implementation uses ymm registers to operate on two AES blocks at a
time. The assembly code is instantiated using a macro so that most of
the source code is shared with other implementations.
This is the optimal implementation on AMD Zen 3. It should also be the
optimal implementation on Intel Alder Lake, which similarly supports
VAES but not AVX512. Comparing to xts-aes-aesni-avx on Zen 3,
xts-aes-vaes-avx2 provides 70% higher AES-256-XTS decryption throughput
with 4096-byte messages, or 23% higher with 512-byte messages.
A large improvement is also seen with CPUs that do support AVX512 (e.g.,
98% higher AES-256-XTS decryption throughput on Ice Lake with 4096-byte
messages), though the following patches add AVX512 optimized
implementations to get a bit more performance on those CPUs.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add an AES-XTS implementation "xts-aes-aesni-avx" for x86_64 CPUs that
have the AES-NI and AVX extensions but not VAES. It's similar to the
existing xts-aes-aesni in that uses xmm registers to operate on one AES
block at a time. It differs from xts-aes-aesni in the following ways:
- It uses the VEX-coded (non-destructive) instructions from AVX.
This improves performance slightly.
- It incorporates some additional optimizations such as interleaving the
tweak computation with AES en/decryption, handling single-page
messages more efficiently, and caching the first round key.
- It supports only 64-bit (x86_64).
- It's generated by an assembly macro that will also be used to generate
VAES-based implementations.
The performance improvement over xts-aes-aesni varies from small to
large, depending on the CPU and other factors such as the size of the
messages en/decrypted. For example, the following increases in
AES-256-XTS decryption throughput are seen on the following CPUs:
| 4096-byte messages | 512-byte messages |
----------------------+--------------------+-------------------+
Intel Skylake | 6% | 31% |
Intel Cascade Lake | 4% | 26% |
AMD Zen 1 | 61% | 73% |
AMD Zen 2 | 36% | 59% |
(The above CPUs don't support VAES, so they can't use VAES instead.)
While this isn't as large an improvement as what VAES provides, this
still seems worthwhile. This implementation is fairly easy to provide
based on the assembly macro that's needed for VAES anyway, and it will
be the best implementation on a large number of CPUs (very roughly, the
CPUs launched by Intel and AMD from 2011 to 2018).
This makes the existing xts-aes-aesni *mostly* obsolete. For now, leave
it in place to support 32-bit kernels and also CPUs like Intel Westmere
that support AES-NI but not AVX. (We could potentially remove it anyway
and just rely on the indirect acceleration via ecb-aes-aesni in those
cases, but that change will need to be considered separately.)
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add an assembly file aes-xts-avx-x86_64.S which contains a macro that
expands into AES-XTS implementations for x86_64 CPUs that support at
least AES-NI and AVX, optionally also taking advantage of VAES,
VPCLMULQDQ, and AVX512 or AVX10.
This patch doesn't expand the macro at all. Later patches will do so,
adding each implementation individually so that the motivation and use
case for each individual implementation can be fully presented.
The file also provides a function aes_xts_encrypt_iv() which handles the
encryption of the IV (tweak), using AES-NI and AVX.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The aesni_set_key() implementation has no error case, yet its prototype
specifies to return an error code.
Modify the function prototype to return void and adjust the related code.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
aes_expandkey() already includes an AES key size check. If AES-NI is
unusable, invoke the function without the size check.
Also, use aes_check_keylen() instead of open code.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>