Commit Graph

849 Commits

Author SHA1 Message Date
Linus Torvalds
02b2f1a7b8 This update includes the following changes:
API:
 
 - Add sig driver API.
 - Remove signing/verification from akcipher API.
 - Move crypto_simd_disabled_for_test to lib/crypto.
 - Add WARN_ON for return values from driver that indicates memory corruption.
 
 Algorithms:
 
 - Provide crc32-arch and crc32c-arch through Crypto API.
 - Optimise crc32c code size on x86.
 - Optimise crct10dif on arm/arm64.
 - Optimise p10-aes-gcm on powerpc.
 - Optimise aegis128 on x86.
 - Output full sample from test interface in jitter RNG.
 - Retry without padata when it fails in pcrypt.
 
 Drivers:
 
 - Add support for Airoha EN7581 TRNG.
 - Add support for STM32MP25x platforms in stm32.
 - Enable iproc-r200 RNG driver on BCMBCA.
 - Add Broadcom BCM74110 RNG driver.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmc6sQsACgkQxycdCkmx
 i6dfHxAAnkI65TE6agZq9DlkEU4ZqOsxxdk0MsGIhbCUTxW3KENzu9vtKjnvg9T/
 Ou0d2J49ny87Y4zaA59Wf/Q1+gg5YSQR5kelonpfrPLkCkJjr72HZpyCHv8TTzEC
 uHHoVj9cnPIF5/yfiqQsrWT1ACip9vn+slyVPaMJV1qR6gnvnSALtsg4e/vKHkn7
 ZMaf2pZ2ROYXdB02nMK5KQcCrxD64MQle/yQepY44eYjnT+XclkqPdi6o1nUSpj/
 RFAeY0jFSTu0pj3DqT48TnU/LiiNLlFOZrGjCdEySoac63vmTtKqfYDmrRaFz4hB
 sucxbgJ3xnnYseRijtfXnxaD/IkDJln+ipGNQKAZLfOVMDCTxPdYGmOpobMTXMS+
 0sY0eAHgqr23P9pOp+sOzcAEFIqg6llAYQVWx3Zl4vpXBUuxzg6AqmHnPicnck7y
 Lw1cJhQxij2De3dG2ZL/0dgQxMjGN/YfCM8SSg6l+Xn3j4j47rqJNH2ZsmXtbJ2n
 kTkmemmWdgRR1IvgQQGsvyKs9ThkcEDW+IzW26SUv3Clvru2NSkX4ZPHbezZQf+D
 R0wMZsW3Fw7Zymerz1GIBSqdLnsyFWtIAjukDpOR6ordPgOBeDt76v6tw5vL2/II
 KYoeN1pdEEecwuhAsEvCryT5ZG4noBeNirf/ElWAfEybgcXiTks=
 =T8pa
 -----END PGP SIGNATURE-----

Merge tag 'v6.13-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto updates from Herbert Xu:
 "API:
   - Add sig driver API
   - Remove signing/verification from akcipher API
   - Move crypto_simd_disabled_for_test to lib/crypto
   - Add WARN_ON for return values from driver that indicates memory
     corruption

  Algorithms:
   - Provide crc32-arch and crc32c-arch through Crypto API
   - Optimise crc32c code size on x86
   - Optimise crct10dif on arm/arm64
   - Optimise p10-aes-gcm on powerpc
   - Optimise aegis128 on x86
   - Output full sample from test interface in jitter RNG
   - Retry without padata when it fails in pcrypt

  Drivers:
   - Add support for Airoha EN7581 TRNG
   - Add support for STM32MP25x platforms in stm32
   - Enable iproc-r200 RNG driver on BCMBCA
   - Add Broadcom BCM74110 RNG driver"

* tag 'v6.13-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (112 commits)
  crypto: marvell/cesa - fix uninit value for struct mv_cesa_op_ctx
  crypto: cavium - Fix an error handling path in cpt_ucode_load_fw()
  crypto: aesni - Move back to module_init
  crypto: lib/mpi - Export mpi_set_bit
  crypto: aes-gcm-p10 - Use the correct bit to test for P10
  hwrng: amd - remove reference to removed PPC_MAPLE config
  crypto: arm/crct10dif - Implement plain NEON variant
  crypto: arm/crct10dif - Macroify PMULL asm code
  crypto: arm/crct10dif - Use existing mov_l macro instead of __adrl
  crypto: arm64/crct10dif - Remove remaining 64x64 PMULL fallback code
  crypto: arm64/crct10dif - Use faster 16x64 bit polynomial multiply
  crypto: arm64/crct10dif - Remove obsolete chunking logic
  crypto: bcm - add error check in the ahash_hmac_init function
  crypto: caam - add error check to caam_rsa_set_priv_key_form
  hwrng: bcm74110 - Add Broadcom BCM74110 RNG driver
  dt-bindings: rng: add binding for BCM74110 RNG
  padata: Clean up in padata_do_multithreaded()
  crypto: inside-secure - Fix the return value of safexcel_xcbcmac_cra_init()
  crypto: qat - Fix missing destroy_workqueue in adf_init_aer()
  crypto: rsassa-pkcs1 - Reinstate support for legacy protocols
  ...
2024-11-19 10:28:41 -08:00
Herbert Xu
dccd55892b crypto: aesni - Move back to module_init
This patch reverts commit 0fbafd06bd
("crypto: aesni - fix failing setkey for rfc4106-gcm-aesni") by
moving the aesni init function back to module_init from late_initcall.

The original patch was needed because tests were synchronous.  This
is no longer the case so there is no need to postpone the registration.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-11-15 19:52:51 +08:00
Eric Biggers
7cc26d4a5f crypto: x86/aegis128 - remove unneeded RETs
Remove returns that are immediately followed by another return.

Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-28 18:33:10 +08:00
Eric Biggers
a09be0354b crypto: x86/aegis128 - remove unneeded FRAME_BEGIN and FRAME_END
Stop using FRAME_BEGIN and FRAME_END in the AEGIS assembly functions,
since all these functions are now leaf functions.  This eliminates some
unnecessary instructions.

Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-28 18:33:10 +08:00
Eric Biggers
a0927a03e7 crypto: x86/aegis128 - take advantage of block-aligned len
Update a caller of aegis128_aesni_ad() to round down the length to a
block boundary.  After that, aegis128_aesni_ad(), aegis128_aesni_enc(),
and aegis128_aesni_dec() are only passed whole blocks.  Update the
assembly code to take advantage of that, which eliminates some unneeded
instructions.  For aegis128_aesni_enc() and aegis128_aesni_dec(), the
length is also always nonzero, so stop checking for zero length.

Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-28 18:33:10 +08:00
Eric Biggers
933e897431 crypto: x86/aegis128 - optimize partial block handling using SSE4.1
Optimize the code that loads and stores partial blocks, taking advantage
of SSE4.1.  The code is adapted from that in aes-gcm-aesni-x86_64.S.

Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-28 18:33:10 +08:00
Eric Biggers
8da94b300f crypto: x86/aegis128 - improve assembly function prototypes
Adjust the prototypes of the AEGIS assembly functions:

- Use proper types instead of 'void *', when applicable.

- Move the length parameter to after the buffers it describes rather
  than before, to match the usual convention.  Also shorten its name to
  just len (which is the name used in the assembly code).

- Declare register aliases at the beginning of each function rather than
  once per file.  This was necessary because len was moved, but also it
  allows adding some aliases where raw registers were used before.

- Put assoclen and cryptlen in the correct order when declaring the
  finalization function in the .c file.

- Remove the unnecessary "crypto_" prefix.

Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-28 18:33:10 +08:00
Eric Biggers
af2aff7caf crypto: x86/aegis128 - optimize length block preparation using SSE4.1
Start using SSE4.1 instructions in the AES-NI AEGIS code, with the first
use case being preparing the length block in fewer instructions.

In practice this does not reduce the set of CPUs on which the code can
run, because all Intel and AMD CPUs with AES-NI also have SSE4.1.

Upgrade the existing SSE2 feature check to SSE4.1, though it seems this
check is not strictly necessary; the aesni-intel module has been getting
away with using SSE4.1 despite checking for AES-NI only.

Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-28 18:33:10 +08:00
Eric Biggers
595bca25a6 crypto: x86/aegis128 - don't bother with special code for aligned data
Remove the AEGIS assembly code paths that were "optimized" to operate on
16-byte aligned data using movdqa, and instead just use the code paths
that use movdqu and can handle data with any alignment.

This does not reduce performance.  movdqa is basically a historical
artifact; on aligned data, movdqu and movdqa have had the same
performance since Intel Nehalem (2008) and AMD Bulldozer (2011).  And
code that requires AES-NI cannot run on CPUs older than those anyway.

Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-28 18:33:10 +08:00
Eric Biggers
b8d2e7bac3 crypto: x86/aegis128 - eliminate some indirect calls
Instead of using a struct of function pointers to decide whether to call
the encryption or decryption assembly functions, use a conditional
branch on a bool.  Force-inline the functions to avoid actually
generating the branch.  This improves performance slightly since
indirect calls are slow.  Remove the now-unnecessary CFI stubs.

Note that just force-inlining the existing functions might cause the
compiler to optimize out the indirect branches, but that would not be a
reliable way to do it and the CFI stubs would still be required.

Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-28 18:33:10 +08:00
Eric Biggers
ebb445f5e7 crypto: x86/aegis128 - remove no-op init and exit functions
Don't bother providing empty stubs for the init and exit methods in
struct aead_alg, since they are optional anyway.

Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-28 18:33:10 +08:00
Eric Biggers
3b2f2d22fb crypto: x86/aegis128 - access 32-bit arguments as 32-bit
Fix the AEGIS assembly code to access 'unsigned int' arguments as 32-bit
values instead of 64-bit, since the upper bits of the corresponding
64-bit registers are not guaranteed to be zero.

Note: there haven't been any reports of this bug actually causing
incorrect behavior.  Neither gcc nor clang guarantee zero-extension to
64 bits, but zero-extension is likely to happen in practice because most
instructions that operate on 32-bit registers zero-extend to 64 bits.

Fixes: 1d373d4e8e ("crypto: x86 - Add optimized AEGIS implementations")
Cc: stable@vger.kernel.org
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-28 18:33:10 +08:00
Eric Biggers
84dd048cf8 crypto: x86/crc32c - eliminate jump table and excessive unrolling
crc32c-pcl-intel-asm_64.S has a loop with 1 to 127 iterations fully
unrolled and uses a jump table to jump into the correct location.  This
optimization is misguided, as it bloats the binary code size and
introduces an indirect call.  x86_64 CPUs can predict loops well, so it
is fine to just use a loop instead.  Loop bookkeeping instructions can
compete with the crc instructions for the ALUs, but this is easily
mitigated by unrolling the loop by a smaller amount, such as 4 times.

Therefore, re-roll the loop and make related tweaks to the code.

This reduces the binary code size of crc_pclmul() from 4546 bytes to 418
bytes, a 91% reduction.  In general it also makes the code faster, with
some large improvements seen when retpoline is enabled.

More detailed performance results are shown below.  They are given as
percent improvement in throughput (negative means regressed) for CPU
microarchitecture vs. input length in bytes.  E.g. an improvement from
40 GB/s to 50 GB/s would be listed as 25%.

Table 1: Results with retpoline enabled (the default):

                       |   512 |   833 |  1024 |  2000 |  3173 |  4096 |
  ---------------------+-------+-------+-------+------ +-------+-------+
  Intel Haswell        | 35.0% | 20.7% | 17.8% |  9.7% | -0.2% |  4.4% |
  Intel Emerald Rapids | 66.8% | 45.2% | 36.3% | 19.3% |  0.0% |  5.4% |
  AMD Zen 2            | 29.5% | 17.2% | 13.5% |  8.6% | -0.5% |  2.8% |

Table 2: Results with retpoline disabled:

                       |   512 |   833 |  1024 |  2000 |  3173 |  4096 |
  ---------------------+-------+-------+-------+------ +-------+-------+
  Intel Haswell        |  3.3% |  4.8% |  4.5% |  0.9% | -2.9% |  0.3% |
  Intel Emerald Rapids |  7.5% |  6.4% |  5.2% |  2.3% | -0.0% |  0.6% |
  AMD Zen 2            | 11.8% |  1.4% |  0.2% |  1.3% | -0.9% | -0.2% |

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-26 14:41:59 +08:00
Eric Biggers
eebcadfa21 crypto: x86/crc32c - access 32-bit arguments as 32-bit
Fix crc32c-pcl-intel-asm_64.S to access 32-bit arguments as 32-bit
values instead of 64-bit, since the upper bits of the corresponding
64-bit registers are not guaranteed to be zero.  Also update the type of
the length argument to be unsigned int rather than int, as the assembly
code treats it as unsigned.

Note: there haven't been any reports of this bug actually causing
incorrect behavior.  Neither gcc nor clang guarantee zero-extension to
64 bits, but zero-extension is likely to happen in practice because most
instructions that operate on 32-bit registers zero-extend to 64 bits.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-26 14:41:59 +08:00
Eric Biggers
84ebf9dbe6 crypto: x86/crc32c - simplify code for handling fewer than 200 bytes
The assembly code in crc32c-pcl-intel-asm_64.S is invoked only for
lengths >= 512, due to the overhead of saving and restoring FPU state.
Therefore, it is unnecessary for this code to be excessively "optimized"
for lengths < 200.  Eliminate the excessive unrolling of this part of
the code and use a more straightforward qword-at-a-time loop.

Note: the part of the code in question is not entirely redundant, as it
is still used to process any remainder mod 24, as well as any remaining
data when fewer than 200 bytes remain after least one 3072-byte chunk.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-26 14:41:59 +08:00
Dr. David Alan Gilbert
528888f33d crypto: x86/cast5 - Remove unused cast5_ctr_16way
commit e2d60e2f59 ("crypto: x86/cast5 - drop CTR mode implementation")

removed the calls to cast5_ctr_16way but left the avx implementation.

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-10 17:08:02 +08:00
Al Viro
5f60d5f6bb move asm/unaligned.h to linux/unaligned.h
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.

auto-generated by the following:

for i in `git grep -l -w asm/unaligned.h`; do
	sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
	sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-10-02 17:23:23 -04:00
Eric Biggers
c299d7af9d crypto: x86/aesni - update docs for aesni-intel module
Update the kconfig help and module description to reflect that VAES
instructions are now used in some cases.  Also fix XTR => XCTR.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-09-06 14:50:45 +08:00
Fangrui Song
3363c460ef crypto: x86/sha256 - Add parentheses around macros' single arguments
The macros FOUR_ROUNDS_AND_SCHED and DO_4ROUNDS rely on an
unexpected/undocumented behavior of the GNU assembler, which might
change in the future
(https://sourceware.org/bugzilla/show_bug.cgi?id=32073).

    M (1) (2) // 1 arg !? Future: 2 args
    M 1 + 2   // 1 arg !? Future: 3 args

    M 1 2     // 2 args

Add parentheses around the single arguments to support future GNU
assembler and LLVM integrated assembler (when the IsOperator hack from
the following link is dropped).

Link: 055006475e
Signed-off-by: Fangrui Song <maskray@google.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-08-24 21:36:07 +08:00
Eric Biggers
001412493e crypto: x86/aes-gcm - fix PREEMPT_RT issue in gcm_crypt()
On PREEMPT_RT, kfree() takes sleeping locks and must not be called with
preemption disabled.  Therefore, on PREEMPT_RT skcipher_walk_done() must
not be called from within a kernel_fpu_{begin,end}() pair, even when
it's the last call which is guaranteed to not allocate memory.

Therefore, move the last skcipher_walk_done() in gcm_crypt() to the end
of the function so that it goes after the kernel_fpu_end().  To make
this work cleanly, rework the data processing loop to handle only
non-last data segments.

Fixes: b06affb1cb ("crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM")
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Closes: https://lore.kernel.org/linux-crypto/20240802102333.itejxOsJ@linutronix.de
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-08-10 12:25:34 +08:00
Eric Biggers
e6e758fa64 crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM
Rewrite the AES-NI implementations of AES-GCM, taking advantage of
things I learned while writing the VAES-AVX10 implementations.  This is
a complete rewrite that reduces the AES-NI GCM source code size by about
70% and the binary code size by about 95%, while not regressing
performance and in fact improving it significantly in many cases.

The following summarizes the state before this patch:

- The aesni-intel module registered algorithms "generic-gcm-aesni" and
  "rfc4106-gcm-aesni" with the crypto API that actually delegated to one
  of three underlying implementations according to the CPU capabilities
  detected at runtime: AES-NI, AES-NI + AVX, or AES-NI + AVX2.

- The AES-NI + AVX and AES-NI + AVX2 assembly code was in
  aesni-intel_avx-x86_64.S and consisted of 2804 lines of source and
  257 KB of binary.  This massive binary size was not really
  appropriate, and depending on the kconfig it could take up over 1% the
  size of the entire vmlinux.  The main loops did 8 blocks per
  iteration.  The AVX code minimized the use of carryless multiplication
  whereas the AVX2 code did not.  The "AVX2" code did not actually use
  AVX2; the check for AVX2 was really a check for Intel Haswell or later
  to detect support for fast carryless multiplication.  The long source
  length was caused by factors such as significant code duplication.

- The AES-NI only assembly code was in aesni-intel_asm.S and consisted
  of 1501 lines of source and 15 KB of binary.  The main loops did 4
  blocks per iteration and minimized the use of carryless multiplication
  by using Karatsuba multiplication and a multiplication-less reduction.

- The assembly code was contributed in 2010-2013.  Maintenance has been
  sporadic and most design choices haven't been revisited.

- The assembly function prototypes and the corresponding glue code were
  separate from and were not consistent with the new VAES-AVX10 code I
  recently added.  The older code had several issues such as not
  precomputing the GHASH key powers, which hurt performance.

This rewrite achieves the following goals:

- Much shorter source and binary sizes.  The assembly source shrinks
  from 4300 lines to 1130 lines, and it produces about 9 KB of binary
  instead of 272 KB.  This is achieved via a better designed AES-GCM
  implementation that doesn't excessively unroll the code and instead
  prioritizes the parts that really matter.  Sharing the C glue code
  with the VAES-AVX10 implementations also saves 250 lines of C source.

- Improve performance on most (possibly all) CPUs on which this code
  runs, for most (possibly all) message lengths.  Benchmark results are
  given in Tables 1 and 2 below.

- Use the same function prototypes and glue code as the new VAES-AVX10
  algorithms.  This fixes some issues with the integration of the
  assembly and results in some significant performance improvements,
  primarily on short messages.  Also, the AVX and non-AVX
  implementations are now registered as separate algorithms with the
  crypto API, which makes them both testable by the self-tests.

- Keep support for AES-NI without AVX (for Westmere, Silvermont,
  Goldmont, and Tremont), but unify the source code with AES-NI + AVX.
  Since 256-bit vectors cannot be used without VAES anyway, this is made
  feasible by just using the non-VEX coded form of most instructions.

- Use a unified approach where the main loop does 8 blocks per iteration
  and uses Karatsuba multiplication to save one pclmulqdq per block but
  does not use the multiplication-less reduction.  This strikes a good
  balance across the range of CPUs on which this code runs.

- Don't spam the kernel log with an informational message on every boot.

The following tables summarize the improvement in AES-GCM throughput on
various CPU microarchitectures as a result of this patch:

Table 1: AES-256-GCM encryption throughput improvement,
         CPU microarchitecture vs. message length in bytes:

                   | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
-------------------+-------+-------+-------+-------+-------+-------+
Intel Broadwell    |    2% |    8% |   11% |   18% |   31% |   26% |
Intel Skylake      |    1% |    4% |    7% |   12% |   26% |   19% |
Intel Cascade Lake |    3% |    8% |   10% |   18% |   33% |   24% |
AMD Zen 1          |    6% |   12% |    6% |   15% |   27% |   24% |
AMD Zen 2          |    8% |   13% |   13% |   19% |   26% |   28% |
AMD Zen 3          |    8% |   14% |   13% |   19% |   26% |   25% |

                   |   300 |   200 |    64 |    63 |    16 |
-------------------+-------+-------+-------+-------+-------+
Intel Broadwell    |   35% |   29% |   45% |   55% |   54% |
Intel Skylake      |   25% |   19% |   28% |   33% |   27% |
Intel Cascade Lake |   36% |   28% |   39% |   49% |   54% |
AMD Zen 1          |   27% |   22% |   23% |   29% |   26% |
AMD Zen 2          |   32% |   24% |   22% |   25% |   31% |
AMD Zen 3          |   30% |   24% |   22% |   23% |   26% |

Table 2: AES-256-GCM decryption throughput improvement,
         CPU microarchitecture vs. message length in bytes:

                   | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
-------------------+-------+-------+-------+-------+-------+-------+
Intel Broadwell    |    3% |    8% |   11% |   19% |   32% |   28% |
Intel Skylake      |    3% |    4% |    7% |   13% |   28% |   27% |
Intel Cascade Lake |    3% |    9% |   11% |   19% |   33% |   28% |
AMD Zen 1          |   15% |   18% |   14% |   20% |   36% |   33% |
AMD Zen 2          |    9% |   16% |   13% |   21% |   26% |   27% |
AMD Zen 3          |    8% |   15% |   12% |   18% |   23% |   23% |

                   |   300 |   200 |    64 |    63 |    16 |
-------------------+-------+-------+-------+-------+-------+
Intel Broadwell    |   36% |   31% |   40% |   51% |   53% |
Intel Skylake      |   28% |   21% |   23% |   30% |   30% |
Intel Cascade Lake |   36% |   29% |   36% |   47% |   53% |
AMD Zen 1          |   35% |   31% |   32% |   35% |   36% |
AMD Zen 2          |   31% |   30% |   27% |   38% |   30% |
AMD Zen 3          |   27% |   23% |   24% |   32% |   26% |

The above numbers are percentage improvements in single-thread
throughput, so e.g. an increase from 3000 MB/s to 3300 MB/s would be
listed as 10%.  They were collected by directly measuring the Linux
crypto API performance using a custom kernel module.  Note that indirect
benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O)
include more overhead and won't see quite as much of a difference.  All
these benchmarks used an associated data length of 16 bytes.  Note that
AES-GCM is almost always used with short associated data lengths.

I didn't test Intel CPUs before Broadwell, AMD CPUs before Zen 1, or
Intel low-power CPUs, as these weren't readily available to me.
However, based on the design of the new code and the available
information about these other CPU microarchitectures, I wouldn't expect
any significant regressions, and there's a good chance performance is
improved just as it is above.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-06-07 19:47:58 +08:00
Eric Biggers
b06affb1cb crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM
Add implementations of AES-GCM for x86_64 CPUs that support VAES (vector
AES), VPCLMULQDQ (vector carryless multiplication), and either AVX512 or
AVX10.  There are two implementations, sharing most source code: one
using 256-bit vectors and one using 512-bit vectors.  This patch
improves AES-GCM performance by up to 162%; see Tables 1 and 2 below.

I wrote the new AES-GCM assembly code from scratch, focusing on
correctness, performance, code size (both source and binary), and
documenting the source.  The new assembly file aes-gcm-avx10-x86_64.S is
about 1200 lines including extensive comments, and it generates less
than 8 KB of binary code.  The main loop does 4 vectors at a time, with
the AES and GHASH instructions interleaved.  Any remainder is handled
using a simple 1 vector at a time loop, with masking.

Several VAES + AVX512 implementations of AES-GCM exist from Intel,
including one in OpenSSL and one proposed for inclusion in Linux in 2021
(https://lore.kernel.org/linux-crypto/1611386920-28579-6-git-send-email-megha.dey@intel.com/).
These aren't really suitable to be used, though, due to the massive
amount of binary code generated (696 KB for OpenSSL, 200 KB for Linux)
and well as the significantly larger amount of assembly source (4978
lines for OpenSSL, 1788 lines for Linux).  Also, Intel's code does not
support 256-bit vectors, which makes it not usable on future
AVX10/256-only CPUs, and also not ideal for certain Intel CPUs that have
downclocking issues.  So I ended up starting from scratch.  Usually my
much shorter code is actually slightly faster than Intel's AVX512 code,
though it depends on message length and on which of Intel's
implementations is used; for details, see Tables 3 and 4 below.

To facilitate potential integration into other projects, I've
dual-licensed aes-gcm-avx10-x86_64.S under Apache-2.0 OR BSD-2-Clause,
the same as the recently added RISC-V crypto code.

The following two tables summarize the performance improvement over the
existing AES-GCM code in Linux that uses AES-NI and AVX2:

Table 1: AES-256-GCM encryption throughput improvement,
         CPU microarchitecture vs. message length in bytes:

                      | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
----------------------+-------+-------+-------+-------+-------+-------+
Intel Ice Lake        |   42% |   48% |   60% |   62% |   70% |   69% |
Intel Sapphire Rapids |  157% |  145% |  162% |  119% |   96% |   96% |
Intel Emerald Rapids  |  156% |  144% |  161% |  115% |   95% |  100% |
AMD Zen 4             |  103% |   89% |   78% |   56% |   54% |   54% |

                      |   300 |   200 |    64 |    63 |    16 |
----------------------+-------+-------+-------+-------+-------+
Intel Ice Lake        |   66% |   48% |   49% |   70% |   53% |
Intel Sapphire Rapids |   80% |   60% |   41% |   62% |   38% |
Intel Emerald Rapids  |   79% |   60% |   41% |   62% |   38% |
AMD Zen 4             |   51% |   35% |   27% |   32% |   25% |

Table 2: AES-256-GCM decryption throughput improvement,
         CPU microarchitecture vs. message length in bytes:

                      | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
----------------------+-------+-------+-------+-------+-------+-------+
Intel Ice Lake        |   42% |   48% |   59% |   63% |   67% |   71% |
Intel Sapphire Rapids |  159% |  145% |  161% |  125% |  102% |  100% |
Intel Emerald Rapids  |  158% |  144% |  161% |  124% |  100% |  103% |
AMD Zen 4             |  110% |   95% |   80% |   59% |   56% |   54% |

                      |   300 |   200 |    64 |    63 |    16 |
----------------------+-------+-------+-------+-------+-------+
Intel Ice Lake        |   67% |   56% |   46% |   70% |   56% |
Intel Sapphire Rapids |   79% |   62% |   39% |   61% |   39% |
Intel Emerald Rapids  |   80% |   62% |   40% |   58% |   40% |
AMD Zen 4             |   49% |   36% |   30% |   35% |   28% |

The above numbers are percentage improvements in single-thread
throughput, so e.g. an increase from 4000 MB/s to 6000 MB/s would be
listed as 50%.  They were collected by directly measuring the Linux
crypto API performance using a custom kernel module.  Note that indirect
benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O)
include more overhead and won't see quite as much of a difference.  All
these benchmarks used an associated data length of 16 bytes.  Note that
AES-GCM is almost always used with short associated data lengths.

The following two tables summarize how the performance of my code
compares with Intel's AVX512 AES-GCM code, both the version that is in
OpenSSL and the version that was proposed for inclusion in Linux.
Neither version exists in Linux currently, but these are alternative
AES-GCM implementations that could be chosen instead of mine.  I
collected the following numbers on Emerald Rapids using a userspace
benchmark program that calls the assembly functions directly.

I've also included a comparison with Cloudflare's AES-GCM implementation
from https://boringssl-review.googlesource.com/c/boringssl/+/65987/3.

Table 3: VAES-based AES-256-GCM encryption throughput in MB/s,
         implementation name vs. message length in bytes:

                     | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
---------------------+-------+-------+-------+-------+-------+-------+
This implementation  | 14171 | 12956 | 12318 |  9588 |  7293 |  6449 |
AVX512_Intel_OpenSSL | 14022 | 12467 | 11863 |  9107 |  5891 |  6472 |
AVX512_Intel_Linux   | 13954 | 12277 | 11530 |  8712 |  6627 |  5898 |
AVX512_Cloudflare    | 12564 | 11050 | 10905 |  8152 |  5345 |  5202 |

                     |   300 |   200 |    64 |    63 |    16 |
---------------------+-------+-------+-------+-------+-------+
This implementation  |  4939 |  3688 |  1846 |  1821 |   738 |
AVX512_Intel_OpenSSL |  4629 |  4532 |  2734 |  2332 |  1131 |
AVX512_Intel_Linux   |  4035 |  2966 |  1567 |  1330 |   639 |
AVX512_Cloudflare    |  3344 |  2485 |  1141 |  1127 |   456 |

Table 4: VAES-based AES-256-GCM decryption throughput in MB/s,
         implementation name vs. message length in bytes:

                     | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
---------------------+-------+-------+-------+-------+-------+-------+
This implementation  | 14276 | 13311 | 13007 | 11086 |  8268 |  8086 |
AVX512_Intel_OpenSSL | 14067 | 12620 | 12421 |  9587 |  5954 |  7060 |
AVX512_Intel_Linux   | 14116 | 12795 | 11778 |  9269 |  7735 |  6455 |
AVX512_Cloudflare    | 13301 | 12018 | 11919 |  9182 |  7189 |  6726 |

                     |   300 |   200 |    64 |    63 |    16 |
---------------------+-------+-------+-------+-------+-------+
This implementation  |  6454 |  5020 |  2635 |  2602 |  1079 |
AVX512_Intel_OpenSSL |  5184 |  5799 |  2957 |  2545 |  1228 |
AVX512_Intel_Linux   |  4394 |  4247 |  2235 |  1635 |   922 |
AVX512_Cloudflare    |  4289 |  3851 |  1435 |  1417 |   574 |

So, usually my code is actually slightly faster than Intel's code,
though the OpenSSL implementation has a slight edge on messages shorter
than 256 bytes in this microbenchmark.  (This also holds true when doing
the same tests on AMD Zen 4.)  It can be seen that the large code size
(up to 94x larger!) of the Intel implementations doesn't seem to bring
much benefit, so starting from scratch with much smaller code, as I've
done, seems appropriate.  The performance of my code on messages shorter
than 256 bytes could be improved through a limited amount of unrolling,
but it's unclear it would be worth it, given code size considerations
(e.g. caches) that don't get measured in microbenchmarks.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-06-07 19:47:58 +08:00
Jeff Johnson
3aeb1da092 crypto: x86 - add missing MODULE_DESCRIPTION() macros
On x86, make allmodconfig && make W=1 C=1 warns:

WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/crypto/crc32-pclmul.o
WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/crypto/curve25519-x86_64.o

Add the missing MODULE_DESCRIPTION() macro invocations.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-06-07 19:46:39 +08:00
Tony Luck
adc5167be5 crypto: x86/poly1305 - Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-05-31 17:12:30 +08:00
Tony Luck
b2d3d79780 crypto: x86/twofish - Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-05-31 17:12:21 +08:00
Tony Luck
6d85a058cf crypto: x86/aes-xts - switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/20240520224620.9480-2-tony.luck@intel.com
2024-05-22 11:10:48 +02:00
Eric Biggers
ed265f7fd9 crypto: x86/aes-gcm - simplify GCM hash subkey derivation
Remove a redundant expansion of the AES key, and use rodata for zeroes.
Also rename rfc4106_set_hash_subkey() to aes_gcm_derive_hash_subkey()
because it's used for both versions of AES-GCM, not just RFC4106.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-26 17:26:10 +08:00
Eric Biggers
a0bbb1c187 crypto: x86/aes-gcm - delete unused GCM assembly code
Delete aesni_gcm_enc() and aesni_gcm_dec() because they are unused.
Only the incremental AES-GCM functions (aesni_gcm_init(),
aesni_gcm_enc_update(), aesni_gcm_finalize()) are actually used.

This saves 17 KB of object code.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-26 17:26:10 +08:00
Eric Biggers
6a80586474 crypto: x86/aes-xts - simplify loop in xts_crypt_slowpath()
Since the total length processed by the loop in xts_crypt_slowpath() is
a multiple of AES_BLOCK_SIZE, just round the length down to
AES_BLOCK_SIZE even on the last step.  This doesn't change behavior, as
the last step will process a multiple of AES_BLOCK_SIZE regardless.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-26 17:26:10 +08:00
Eric Biggers
543ea178fb crypto: x86/aes-xts - optimize size of instructions operating on lengths
x86_64 has the "interesting" property that the instruction size is
generally a bit shorter for instructions that operate on the 32-bit (or
less) part of registers, or registers that are in the original set of 8.

This patch adjusts the AES-XTS code to take advantage of that property
by changing the LEN parameter from size_t to unsigned int (which is all
that's needed and is what the non-AVX implementation uses) and using the
%eax register for KEYLEN.

This decreases the size of aes-xts-avx-x86_64.o by 1.2%.

Note that changing the kmovq to kmovd was going to be needed anyway to
make the AVX10/256 code really work on CPUs that don't support 512-bit
vectors (since the AVX10 spec says that 64-bit opmask instructions will
only be supported on processors that support 512-bit vectors).

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-19 18:54:19 +08:00
Eric Biggers
e619723a85 crypto: x86/aes-xts - eliminate a few more instructions
- For conditionally subtracting 16 from LEN when decrypting a message
  whose length isn't a multiple of 16, use the cmovnz instruction.

- Fold the addition of 4*VL to LEN into the sub of VL or 16 from LEN.

- Remove an unnecessary test instruction.

This results in slightly shorter code, both source and binary.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-19 18:54:19 +08:00
Eric Biggers
2717e01fc3 crypto: x86/aes-xts - handle AES-128 and AES-192 more efficiently
Decrease the amount of code specific to the different AES variants by
"right-aligning" the sequence of round keys, and for AES-128 and AES-192
just skipping irrelevant rounds at the beginning.

This shrinks the size of aes-xts-avx-x86_64.o by 13.3%, and it improves
the efficiency of AES-128 and AES-192.  The tradeoff is that for AES-256
some additional not-taken conditional jumps are now executed.  But these
are predicted well and are cheap on x86.

Note that the ARMv8 CE based AES-XTS implementation uses a similar
strategy to handle the different AES variants.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-19 18:54:19 +08:00
Eric Biggers
ea9459ef36 crypto: x86/aesni-xts - deduplicate aesni_xts_enc() and aesni_xts_dec()
Since aesni_xts_enc() and aesni_xts_dec() are very similar, generate
them from a macro that's passed an argument enc=1 or enc=0.  This
reduces the length of aesni-intel_asm.S by 112 lines while still
producing the exact same object file in both 32-bit and 64-bit mode.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-19 18:54:19 +08:00
Eric Biggers
1d27e1f5c8 crypto: x86/aes-xts - handle CTS encryption more efficiently
When encrypting a message whose length isn't a multiple of 16 bytes,
encrypt the last full block in the main loop.  This works because only
decryption uses the last two tweaks in reverse order, not encryption.

This improves the performance of decrypting messages whose length isn't
a multiple of the AES block length, shrinks the size of
aes-xts-avx-x86_64.o by 5.0%, and eliminates two instructions (a test
and a not-taken conditional jump) when encrypting a message whose length
*is* a multiple of the AES block length.

While it's not super useful to optimize for ciphertext stealing given
that it's rarely needed in practice, the other two benefits mentioned
above make this optimization worthwhile.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-19 18:54:19 +08:00
Eric Biggers
7daba20cc7 crypto: x86/sha256-ni - simplify do_4rounds
Instead of loading the message words into both MSG and \m0 and then
adding the round constants to MSG, load the message words into \m0 and
the round constants into MSG and then add \m0 to MSG.  This shortens the
source code slightly.  It changes the instructions slightly, but it
doesn't affect binary code size and doesn't seem to affect performance.

Suggested-by: Stefan Kanthak <stefan.kanthak@nexgo.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-19 18:54:18 +08:00
Eric Biggers
59e62b20ac crypto: x86/sha256-ni - optimize code size
- Load the SHA-256 round constants relative to a pointer that points
  into the middle of the constants rather than to the beginning.  Since
  x86 instructions use signed offsets, this decreases the instruction
  length required to access some of the later round constants.

- Use punpcklqdq or punpckhqdq instead of longer instructions such as
  pshufd, pblendw, and palignr.  This doesn't harm performance.

The end result is that sha256_ni_transform shrinks from 839 bytes to 791
bytes, with no loss in performance.

Suggested-by: Stefan Kanthak <stefan.kanthak@nexgo.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-19 18:54:18 +08:00
Eric Biggers
1b5ddb067d crypto: x86/sha256-ni - rename some register aliases
MSGTMP[0-3] are used to hold the message schedule and are not temporary
registers per se.  MSGTMP4 is used as a temporary register for several
different purposes and isn't really related to MSGTMP[0-3].  Rename them
to MSG[0-3] and TMP accordingly.

Also add a comment that clarifies what MSG is.

Suggested-by: Stefan Kanthak <stefan.kanthak@nexgo.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-19 18:54:18 +08:00
Eric Biggers
ffaec34b0f crypto: x86/sha256-ni - convert to use rounds macros
To avoid source code duplication, do the SHA-256 rounds using macros.
This reduces the length of sha256_ni_asm.S by 153 lines while still
producing the exact same object file.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-19 18:54:18 +08:00
Eric Biggers
b924ecd305 crypto: x86/aes-xts - access round keys using single-byte offsets
Access the AES round keys using offsets -7*16 through 7*16, instead of
0*16 through 14*16.  This allows VEX-encoded instructions to address all
round keys using 1-byte offsets, whereas before some needed 4-byte
offsets.  This decreases the code size of aes-xts-avx-x86_64.o by 4.2%.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-19 18:54:18 +08:00
Eric Biggers
751fb2528c crypto: x86/aes-xts - make non-AVX implementation use new glue code
Make the non-AVX implementation of AES-XTS (xts-aes-aesni) use the new
glue code that was introduced for the AVX implementations of AES-XTS.
This reduces code size, and it improves the performance of xts-aes-aesni
due to the optimization for messages that don't span page boundaries.

This required moving the new glue functions higher up in the file and
allowing the IV encryption function to be specified by the caller.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-12 15:07:53 +08:00
Eric Biggers
6a24fdfe1e crypto: x86/sha512-avx2 - add missing vzeroupper
Since sha512_transform_rorx() uses ymm registers, execute vzeroupper
before returning from it.  This is necessary to avoid reducing the
performance of SSE code.

Fixes: e01d69cb01 ("crypto: sha512 - Optimized SHA512 x86_64 assembly routine using AVX instructions.")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-12 15:07:52 +08:00
Eric Biggers
57ce8a4e16 crypto: x86/sha256-avx2 - add missing vzeroupper
Since sha256_transform_rorx() uses ymm registers, execute vzeroupper
before returning from it.  This is necessary to avoid reducing the
performance of SSE code.

Fixes: d34a460092 ("crypto: sha256 - Optimized sha256 x86_64 routine using AVX2's RORX instructions")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-12 15:07:52 +08:00
Eric Biggers
4ad096cca9 crypto: x86/nh-avx2 - add missing vzeroupper
Since nh_avx2() uses ymm registers, execute vzeroupper before returning
from it.  This is necessary to avoid reducing the performance of SSE
code.

Fixes: 0f961f9f67 ("crypto: x86/nhpoly1305 - add AVX2 accelerated NHPoly1305")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-12 15:07:52 +08:00
Eric Biggers
aa2197f566 crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation
Add an AES-XTS implementation "xts-aes-vaes-avx10_512" for x86_64 CPUs
with the VAES, VPCLMULQDQ, and either AVX10/512 or AVX512BW + AVX512VL
extensions.  This implementation uses zmm registers to operate on four
AES blocks at a time.  The assembly code is instantiated using a macro
so that most of the source code is shared with other implementations.

To avoid downclocking on older Intel CPU models, an exclusion list is
used to prevent this 512-bit implementation from being used by default
on some CPU models.  They will use xts-aes-vaes-avx10_256 instead.  For
now, this exclusion list is simply coded into aesni-intel_glue.c.  It
may make sense to eventually move it into a more central location.

xts-aes-vaes-avx10_512 is slightly faster than xts-aes-vaes-avx10_256 on
some current CPUs.  E.g., on AMD Zen 4, AES-256-XTS decryption
throughput increases by 13% with 4096-byte inputs, or 14% with 512-byte
inputs.  On Intel Sapphire Rapids, AES-256-XTS decryption throughput
increases by 2% with 4096-byte inputs, or 3% with 512-byte inputs.

Future CPUs may provide stronger 512-bit support, in which case a larger
benefit should be seen.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-05 15:46:33 +08:00
Eric Biggers
ee63fea005 crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation
Add an AES-XTS implementation "xts-aes-vaes-avx10_256" for x86_64 CPUs
with the VAES, VPCLMULQDQ, and either AVX10/256 or AVX512BW + AVX512VL
extensions.  This implementation avoids using zmm registers, instead
using ymm registers to operate on two AES blocks at a time.  The
assembly code is instantiated using a macro so that most of the source
code is shared with other implementations.

This is the optimal implementation on CPUs that support VAES and AVX512
but where the zmm registers should not be used due to downclocking
effects, for example Intel's Ice Lake.  It should also be the optimal
implementation on future CPUs that support AVX10/256 but not AVX10/512.

The performance is slightly better than that of xts-aes-vaes-avx2, which
uses the same 256-bit vector length, due to factors such as being able
to use ymm16-ymm31 to cache the AES round keys, and being able to use
the vpternlogd instruction to do XORs more efficiently.  For example, on
Ice Lake, the throughput of decrypting 4096-byte messages with
AES-256-XTS is 6.6% higher with xts-aes-vaes-avx10_256 than with
xts-aes-vaes-avx2.  While this is a small improvement, it is
straightforward to provide this implementation (xts-aes-vaes-avx10_256)
as long as we are providing xts-aes-vaes-avx2 and xts-aes-vaes-avx10_512
anyway, due to the way the _aes_xts_crypt macro is structured.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-05 15:46:33 +08:00
Eric Biggers
e787060bdf crypto: x86/aes-xts - wire up VAES + AVX2 implementation
Add an AES-XTS implementation "xts-aes-vaes-avx2" for x86_64 CPUs with
the VAES, VPCLMULQDQ, and AVX2 extensions, but not AVX512 or AVX10.
This implementation uses ymm registers to operate on two AES blocks at a
time.  The assembly code is instantiated using a macro so that most of
the source code is shared with other implementations.

This is the optimal implementation on AMD Zen 3.  It should also be the
optimal implementation on Intel Alder Lake, which similarly supports
VAES but not AVX512.  Comparing to xts-aes-aesni-avx on Zen 3,
xts-aes-vaes-avx2 provides 70% higher AES-256-XTS decryption throughput
with 4096-byte messages, or 23% higher with 512-byte messages.

A large improvement is also seen with CPUs that do support AVX512 (e.g.,
98% higher AES-256-XTS decryption throughput on Ice Lake with 4096-byte
messages), though the following patches add AVX512 optimized
implementations to get a bit more performance on those CPUs.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-05 15:46:33 +08:00
Eric Biggers
996f4dcbd2 crypto: x86/aes-xts - wire up AESNI + AVX implementation
Add an AES-XTS implementation "xts-aes-aesni-avx" for x86_64 CPUs that
have the AES-NI and AVX extensions but not VAES.  It's similar to the
existing xts-aes-aesni in that uses xmm registers to operate on one AES
block at a time.  It differs from xts-aes-aesni in the following ways:

- It uses the VEX-coded (non-destructive) instructions from AVX.
  This improves performance slightly.
- It incorporates some additional optimizations such as interleaving the
  tweak computation with AES en/decryption, handling single-page
  messages more efficiently, and caching the first round key.
- It supports only 64-bit (x86_64).
- It's generated by an assembly macro that will also be used to generate
  VAES-based implementations.

The performance improvement over xts-aes-aesni varies from small to
large, depending on the CPU and other factors such as the size of the
messages en/decrypted.  For example, the following increases in
AES-256-XTS decryption throughput are seen on the following CPUs:

                          | 4096-byte messages | 512-byte messages |
    ----------------------+--------------------+-------------------+
    Intel Skylake         |        6%          |       31%         |
    Intel Cascade Lake    |        4%          |       26%         |
    AMD Zen 1             |        61%         |       73%         |
    AMD Zen 2             |        36%         |       59%         |

(The above CPUs don't support VAES, so they can't use VAES instead.)

While this isn't as large an improvement as what VAES provides, this
still seems worthwhile.  This implementation is fairly easy to provide
based on the assembly macro that's needed for VAES anyway, and it will
be the best implementation on a large number of CPUs (very roughly, the
CPUs launched by Intel and AMD from 2011 to 2018).

This makes the existing xts-aes-aesni *mostly* obsolete.  For now, leave
it in place to support 32-bit kernels and also CPUs like Intel Westmere
that support AES-NI but not AVX.  (We could potentially remove it anyway
and just rely on the indirect acceleration via ecb-aes-aesni in those
cases, but that change will need to be considered separately.)

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-05 15:46:33 +08:00
Eric Biggers
d637168810 crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs
Add an assembly file aes-xts-avx-x86_64.S which contains a macro that
expands into AES-XTS implementations for x86_64 CPUs that support at
least AES-NI and AVX, optionally also taking advantage of VAES,
VPCLMULQDQ, and AVX512 or AVX10.

This patch doesn't expand the macro at all.  Later patches will do so,
adding each implementation individually so that the motivation and use
case for each individual implementation can be fully presented.

The file also provides a function aes_xts_encrypt_iv() which handles the
encryption of the IV (tweak), using AES-NI and AVX.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-05 15:46:33 +08:00
Chang S. Bae
e3299a4c1c crypto: x86/aesni - Update aesni_set_key() to return void
The aesni_set_key() implementation has no error case, yet its prototype
specifies to return an error code.

Modify the function prototype to return void and adjust the related code.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-02 10:49:39 +08:00
Chang S. Bae
d50b35f0c4 crypto: x86/aesni - Rearrange AES key size check
aes_expandkey() already includes an AES key size check. If AES-NI is
unusable, invoke the function without the size check.

Also, use aes_check_keylen() instead of open code.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-02 10:49:39 +08:00