mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
synced 2025-01-06 13:13:51 +00:00
e6e758fa64
Rewrite the AES-NI implementations of AES-GCM, taking advantage of things I learned while writing the VAES-AVX10 implementations. This is a complete rewrite that reduces the AES-NI GCM source code size by about 70% and the binary code size by about 95%, while not regressing performance and in fact improving it significantly in many cases. The following summarizes the state before this patch: - The aesni-intel module registered algorithms "generic-gcm-aesni" and "rfc4106-gcm-aesni" with the crypto API that actually delegated to one of three underlying implementations according to the CPU capabilities detected at runtime: AES-NI, AES-NI + AVX, or AES-NI + AVX2. - The AES-NI + AVX and AES-NI + AVX2 assembly code was in aesni-intel_avx-x86_64.S and consisted of 2804 lines of source and 257 KB of binary. This massive binary size was not really appropriate, and depending on the kconfig it could take up over 1% the size of the entire vmlinux. The main loops did 8 blocks per iteration. The AVX code minimized the use of carryless multiplication whereas the AVX2 code did not. The "AVX2" code did not actually use AVX2; the check for AVX2 was really a check for Intel Haswell or later to detect support for fast carryless multiplication. The long source length was caused by factors such as significant code duplication. - The AES-NI only assembly code was in aesni-intel_asm.S and consisted of 1501 lines of source and 15 KB of binary. The main loops did 4 blocks per iteration and minimized the use of carryless multiplication by using Karatsuba multiplication and a multiplication-less reduction. - The assembly code was contributed in 2010-2013. Maintenance has been sporadic and most design choices haven't been revisited. - The assembly function prototypes and the corresponding glue code were separate from and were not consistent with the new VAES-AVX10 code I recently added. The older code had several issues such as not precomputing the GHASH key powers, which hurt performance. This rewrite achieves the following goals: - Much shorter source and binary sizes. The assembly source shrinks from 4300 lines to 1130 lines, and it produces about 9 KB of binary instead of 272 KB. This is achieved via a better designed AES-GCM implementation that doesn't excessively unroll the code and instead prioritizes the parts that really matter. Sharing the C glue code with the VAES-AVX10 implementations also saves 250 lines of C source. - Improve performance on most (possibly all) CPUs on which this code runs, for most (possibly all) message lengths. Benchmark results are given in Tables 1 and 2 below. - Use the same function prototypes and glue code as the new VAES-AVX10 algorithms. This fixes some issues with the integration of the assembly and results in some significant performance improvements, primarily on short messages. Also, the AVX and non-AVX implementations are now registered as separate algorithms with the crypto API, which makes them both testable by the self-tests. - Keep support for AES-NI without AVX (for Westmere, Silvermont, Goldmont, and Tremont), but unify the source code with AES-NI + AVX. Since 256-bit vectors cannot be used without VAES anyway, this is made feasible by just using the non-VEX coded form of most instructions. - Use a unified approach where the main loop does 8 blocks per iteration and uses Karatsuba multiplication to save one pclmulqdq per block but does not use the multiplication-less reduction. This strikes a good balance across the range of CPUs on which this code runs. - Don't spam the kernel log with an informational message on every boot. The following tables summarize the improvement in AES-GCM throughput on various CPU microarchitectures as a result of this patch: Table 1: AES-256-GCM encryption throughput improvement, CPU microarchitecture vs. message length in bytes: | 16384 | 4096 | 4095 | 1420 | 512 | 500 | -------------------+-------+-------+-------+-------+-------+-------+ Intel Broadwell | 2% | 8% | 11% | 18% | 31% | 26% | Intel Skylake | 1% | 4% | 7% | 12% | 26% | 19% | Intel Cascade Lake | 3% | 8% | 10% | 18% | 33% | 24% | AMD Zen 1 | 6% | 12% | 6% | 15% | 27% | 24% | AMD Zen 2 | 8% | 13% | 13% | 19% | 26% | 28% | AMD Zen 3 | 8% | 14% | 13% | 19% | 26% | 25% | | 300 | 200 | 64 | 63 | 16 | -------------------+-------+-------+-------+-------+-------+ Intel Broadwell | 35% | 29% | 45% | 55% | 54% | Intel Skylake | 25% | 19% | 28% | 33% | 27% | Intel Cascade Lake | 36% | 28% | 39% | 49% | 54% | AMD Zen 1 | 27% | 22% | 23% | 29% | 26% | AMD Zen 2 | 32% | 24% | 22% | 25% | 31% | AMD Zen 3 | 30% | 24% | 22% | 23% | 26% | Table 2: AES-256-GCM decryption throughput improvement, CPU microarchitecture vs. message length in bytes: | 16384 | 4096 | 4095 | 1420 | 512 | 500 | -------------------+-------+-------+-------+-------+-------+-------+ Intel Broadwell | 3% | 8% | 11% | 19% | 32% | 28% | Intel Skylake | 3% | 4% | 7% | 13% | 28% | 27% | Intel Cascade Lake | 3% | 9% | 11% | 19% | 33% | 28% | AMD Zen 1 | 15% | 18% | 14% | 20% | 36% | 33% | AMD Zen 2 | 9% | 16% | 13% | 21% | 26% | 27% | AMD Zen 3 | 8% | 15% | 12% | 18% | 23% | 23% | | 300 | 200 | 64 | 63 | 16 | -------------------+-------+-------+-------+-------+-------+ Intel Broadwell | 36% | 31% | 40% | 51% | 53% | Intel Skylake | 28% | 21% | 23% | 30% | 30% | Intel Cascade Lake | 36% | 29% | 36% | 47% | 53% | AMD Zen 1 | 35% | 31% | 32% | 35% | 36% | AMD Zen 2 | 31% | 30% | 27% | 38% | 30% | AMD Zen 3 | 27% | 23% | 24% | 32% | 26% | The above numbers are percentage improvements in single-thread throughput, so e.g. an increase from 3000 MB/s to 3300 MB/s would be listed as 10%. They were collected by directly measuring the Linux crypto API performance using a custom kernel module. Note that indirect benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O) include more overhead and won't see quite as much of a difference. All these benchmarks used an associated data length of 16 bytes. Note that AES-GCM is almost always used with short associated data lengths. I didn't test Intel CPUs before Broadwell, AMD CPUs before Zen 1, or Intel low-power CPUs, as these weren't readily available to me. However, based on the design of the new code and the available information about these other CPU microarchitectures, I wouldn't expect any significant regressions, and there's a good chance performance is improved just as it is above. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
124 lines
5.3 KiB
Makefile
124 lines
5.3 KiB
Makefile
# SPDX-License-Identifier: GPL-2.0
|
|
#
|
|
# x86 crypto algorithms
|
|
|
|
obj-$(CONFIG_CRYPTO_TWOFISH_586) += twofish-i586.o
|
|
twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o
|
|
obj-$(CONFIG_CRYPTO_TWOFISH_X86_64) += twofish-x86_64.o
|
|
twofish-x86_64-y := twofish-x86_64-asm_64.o twofish_glue.o
|
|
obj-$(CONFIG_CRYPTO_TWOFISH_X86_64_3WAY) += twofish-x86_64-3way.o
|
|
twofish-x86_64-3way-y := twofish-x86_64-asm_64-3way.o twofish_glue_3way.o
|
|
obj-$(CONFIG_CRYPTO_TWOFISH_AVX_X86_64) += twofish-avx-x86_64.o
|
|
twofish-avx-x86_64-y := twofish-avx-x86_64-asm_64.o twofish_avx_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_SERPENT_SSE2_586) += serpent-sse2-i586.o
|
|
serpent-sse2-i586-y := serpent-sse2-i586-asm_32.o serpent_sse2_glue.o
|
|
obj-$(CONFIG_CRYPTO_SERPENT_SSE2_X86_64) += serpent-sse2-x86_64.o
|
|
serpent-sse2-x86_64-y := serpent-sse2-x86_64-asm_64.o serpent_sse2_glue.o
|
|
obj-$(CONFIG_CRYPTO_SERPENT_AVX_X86_64) += serpent-avx-x86_64.o
|
|
serpent-avx-x86_64-y := serpent-avx-x86_64-asm_64.o serpent_avx_glue.o
|
|
obj-$(CONFIG_CRYPTO_SERPENT_AVX2_X86_64) += serpent-avx2.o
|
|
serpent-avx2-y := serpent-avx2-asm_64.o serpent_avx2_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_DES3_EDE_X86_64) += des3_ede-x86_64.o
|
|
des3_ede-x86_64-y := des3_ede-asm_64.o des3_ede_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_CAMELLIA_X86_64) += camellia-x86_64.o
|
|
camellia-x86_64-y := camellia-x86_64-asm_64.o camellia_glue.o
|
|
obj-$(CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64) += camellia-aesni-avx-x86_64.o
|
|
camellia-aesni-avx-x86_64-y := camellia-aesni-avx-asm_64.o camellia_aesni_avx_glue.o
|
|
obj-$(CONFIG_CRYPTO_CAMELLIA_AESNI_AVX2_X86_64) += camellia-aesni-avx2.o
|
|
camellia-aesni-avx2-y := camellia-aesni-avx2-asm_64.o camellia_aesni_avx2_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_BLOWFISH_X86_64) += blowfish-x86_64.o
|
|
blowfish-x86_64-y := blowfish-x86_64-asm_64.o blowfish_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_CAST5_AVX_X86_64) += cast5-avx-x86_64.o
|
|
cast5-avx-x86_64-y := cast5-avx-x86_64-asm_64.o cast5_avx_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_CAST6_AVX_X86_64) += cast6-avx-x86_64.o
|
|
cast6-avx-x86_64-y := cast6-avx-x86_64-asm_64.o cast6_avx_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_AEGIS128_AESNI_SSE2) += aegis128-aesni.o
|
|
aegis128-aesni-y := aegis128-aesni-asm.o aegis128-aesni-glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_CHACHA20_X86_64) += chacha-x86_64.o
|
|
chacha-x86_64-y := chacha-avx2-x86_64.o chacha-ssse3-x86_64.o chacha_glue.o
|
|
chacha-x86_64-$(CONFIG_AS_AVX512) += chacha-avx512vl-x86_64.o
|
|
|
|
obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o
|
|
aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o
|
|
aesni-intel-$(CONFIG_64BIT) += aes_ctrby8_avx-x86_64.o \
|
|
aes-gcm-aesni-x86_64.o \
|
|
aes-xts-avx-x86_64.o
|
|
ifeq ($(CONFIG_AS_VAES)$(CONFIG_AS_VPCLMULQDQ),yy)
|
|
aesni-intel-$(CONFIG_64BIT) += aes-gcm-avx10-x86_64.o
|
|
endif
|
|
|
|
obj-$(CONFIG_CRYPTO_SHA1_SSSE3) += sha1-ssse3.o
|
|
sha1-ssse3-y := sha1_avx2_x86_64_asm.o sha1_ssse3_asm.o sha1_ssse3_glue.o
|
|
sha1-ssse3-$(CONFIG_AS_SHA1_NI) += sha1_ni_asm.o
|
|
|
|
obj-$(CONFIG_CRYPTO_SHA256_SSSE3) += sha256-ssse3.o
|
|
sha256-ssse3-y := sha256-ssse3-asm.o sha256-avx-asm.o sha256-avx2-asm.o sha256_ssse3_glue.o
|
|
sha256-ssse3-$(CONFIG_AS_SHA256_NI) += sha256_ni_asm.o
|
|
|
|
obj-$(CONFIG_CRYPTO_SHA512_SSSE3) += sha512-ssse3.o
|
|
sha512-ssse3-y := sha512-ssse3-asm.o sha512-avx-asm.o sha512-avx2-asm.o sha512_ssse3_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_BLAKE2S_X86) += libblake2s-x86_64.o
|
|
libblake2s-x86_64-y := blake2s-core.o blake2s-glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += ghash-clmulni-intel.o
|
|
ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_POLYVAL_CLMUL_NI) += polyval-clmulni.o
|
|
polyval-clmulni-y := polyval-clmulni_asm.o polyval-clmulni_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_CRC32C_INTEL) += crc32c-intel.o
|
|
crc32c-intel-y := crc32c-intel_glue.o
|
|
crc32c-intel-$(CONFIG_64BIT) += crc32c-pcl-intel-asm_64.o
|
|
|
|
obj-$(CONFIG_CRYPTO_CRC32_PCLMUL) += crc32-pclmul.o
|
|
crc32-pclmul-y := crc32-pclmul_asm.o crc32-pclmul_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_CRCT10DIF_PCLMUL) += crct10dif-pclmul.o
|
|
crct10dif-pclmul-y := crct10dif-pcl-asm_64.o crct10dif-pclmul_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_POLY1305_X86_64) += poly1305-x86_64.o
|
|
poly1305-x86_64-y := poly1305-x86_64-cryptogams.o poly1305_glue.o
|
|
targets += poly1305-x86_64-cryptogams.S
|
|
|
|
obj-$(CONFIG_CRYPTO_NHPOLY1305_SSE2) += nhpoly1305-sse2.o
|
|
nhpoly1305-sse2-y := nh-sse2-x86_64.o nhpoly1305-sse2-glue.o
|
|
obj-$(CONFIG_CRYPTO_NHPOLY1305_AVX2) += nhpoly1305-avx2.o
|
|
nhpoly1305-avx2-y := nh-avx2-x86_64.o nhpoly1305-avx2-glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_CURVE25519_X86) += curve25519-x86_64.o
|
|
|
|
obj-$(CONFIG_CRYPTO_SM3_AVX_X86_64) += sm3-avx-x86_64.o
|
|
sm3-avx-x86_64-y := sm3-avx-asm_64.o sm3_avx_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_SM4_AESNI_AVX_X86_64) += sm4-aesni-avx-x86_64.o
|
|
sm4-aesni-avx-x86_64-y := sm4-aesni-avx-asm_64.o sm4_aesni_avx_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_SM4_AESNI_AVX2_X86_64) += sm4-aesni-avx2-x86_64.o
|
|
sm4-aesni-avx2-x86_64-y := sm4-aesni-avx2-asm_64.o sm4_aesni_avx2_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_ARIA_AESNI_AVX_X86_64) += aria-aesni-avx-x86_64.o
|
|
aria-aesni-avx-x86_64-y := aria-aesni-avx-asm_64.o aria_aesni_avx_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_ARIA_AESNI_AVX2_X86_64) += aria-aesni-avx2-x86_64.o
|
|
aria-aesni-avx2-x86_64-y := aria-aesni-avx2-asm_64.o aria_aesni_avx2_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_ARIA_GFNI_AVX512_X86_64) += aria-gfni-avx512-x86_64.o
|
|
aria-gfni-avx512-x86_64-y := aria-gfni-avx512-asm_64.o aria_gfni_avx512_glue.o
|
|
|
|
quiet_cmd_perlasm = PERLASM $@
|
|
cmd_perlasm = $(PERL) $< > $@
|
|
$(obj)/%.S: $(src)/%.pl FORCE
|
|
$(call if_changed,perlasm)
|
|
|
|
# Disable GCOV in odd or sensitive code
|
|
GCOV_PROFILE_curve25519-x86_64.o := n
|