Conference paper Open Access
Cheng, H.; Großschädl, J.; Tian, J.; Roenne, P.; Ryan. P.
Single-Instruction-Multiple-Data (SIMD) extensions like Intel's AVX2 o er a great potential to accelerate elliptic curve cryptography compared to a straightforward implementation using only base x64 instructions. All existing AVX2 implementations of scalar multiplication on Curve25519 and alternative elliptic curves are optimized for low latency. We argue in this paper that many applications, most notably server-side TLS handshake processing, would bene t more from throughput-optimized implementations than latency-optimized ones. To support this argument we introduce throughput-optimized AVX2 implementations of variable-base scalar multiplication on Curve25519 and xed-base scalar multiplication on Ed25519. Both implementations perform four scalar multiplications in parallel, whereby each scalar multiplication uses a 64-bit element of a 256-bit AVX2 vector. The eld arithmetic is based on a radix-229 representation of the eld elements, which makes it possible to execute four parallel multiplications modulo a multiple of p = 2255 19 in just 88 Skylake cycles. Four variable-base scalar multiplications on Curve25519 require less than 250,000 Skylake cycles, which translates into a throughput of 32,318 scalar multiplications per second at a clock frequency of 2 GHz. For comparison, the currently best latency-optimized AVX2 implementation reaches a throughput of only about 21,000 scalar multiplications per second on the same Skylake processor.