From d52d5ad147841fb2821b0dcbe3d5b205cf510a99 Mon Sep 17 00:00:00 2001 From: Andy Polyakov Date: Sun, 5 Sep 2010 19:52:14 +0000 Subject: [PATCH] modes/asm/ghash-*.pl: switch to [more reproducible] performance results collected with 'apps/openssl speed ghash'. --- crypto/modes/asm/ghash-parisc.pl | 6 +-- crypto/modes/asm/ghash-sparcv9.pl | 4 +- crypto/modes/asm/ghash-x86.pl | 69 +++++++++++++++---------------- crypto/modes/asm/ghash-x86_64.pl | 9 ++-- 4 files changed, 44 insertions(+), 44 deletions(-) diff --git a/crypto/modes/asm/ghash-parisc.pl b/crypto/modes/asm/ghash-parisc.pl index 8849f01bff..8c7454ee93 100644 --- a/crypto/modes/asm/ghash-parisc.pl +++ b/crypto/modes/asm/ghash-parisc.pl @@ -12,9 +12,9 @@ # The module implements "4-bit" GCM GHASH function and underlying # single multiplication operation in GF(2^128). "4-bit" means that it # uses 256 bytes per-key table [+128 bytes shared table]. On PA-7100LC -# it processes one byte in 19 cycles, which is more than twice as fast -# as code generated by gcc 3.2. PA-RISC 2.0 loop is scheduled for 8 -# cycles, but measured performance on PA-8600 system is ~9 cycles per +# it processes one byte in 19.6 cycles, which is more than twice as +# fast as code generated by gcc 3.2. PA-RISC 2.0 loop is scheduled for +# 8 cycles, but measured performance on PA-8600 system is ~9 cycles per # processed byte. This is ~2.2x faster than 64-bit code generated by # vendor compiler (which used to be very hard to beat:-). # diff --git a/crypto/modes/asm/ghash-sparcv9.pl b/crypto/modes/asm/ghash-sparcv9.pl index 367e1b69da..70e7b044a3 100644 --- a/crypto/modes/asm/ghash-sparcv9.pl +++ b/crypto/modes/asm/ghash-sparcv9.pl @@ -17,8 +17,8 @@ # # gcc 3.3.x cc 5.2 this assembler # -# 32-bit build 81.0 48.6 11.8 (+586%/+311%) -# 64-bit build 27.5 20.3 11.8 (+133%/+72%) +# 32-bit build 81.4 43.3 12.6 (+546%/+244%) +# 64-bit build 20.2 21.2 12.6 (+60%/+68%) # # Here is data collected on UltraSPARC T1 system running Linux: # diff --git a/crypto/modes/asm/ghash-x86.pl b/crypto/modes/asm/ghash-x86.pl index a768a056f3..fca19e41f0 100644 --- a/crypto/modes/asm/ghash-x86.pl +++ b/crypto/modes/asm/ghash-x86.pl @@ -21,17 +21,18 @@ # # gcc 2.95.3(*) MMX assembler x86 assembler # -# Pentium 100/112(**) - 50 -# PIII 63 /77 12.2 24 -# P4 96 /122 18.0 84(***) -# Opteron 50 /71 10.1 30 -# Core2 54 /68 8.6 18 +# Pentium 105/111(**) - 50 +# PIII 68 /75 12.2 24 +# P4 125/125 17.8 84(***) +# Opteron 66 /70 10.1 30 +# Core2 54 /67 8.4 18 # # (*) gcc 3.4.x was observed to generate few percent slower code, # which is one of reasons why 2.95.3 results were chosen, # another reason is lack of 3.4.x results for older CPUs; -# comparison is not completely fair, because C results are -# for vanilla "256B" implementations, not "528B";-) +# comparison with MMX results is not completely fair, because C +# results are for vanilla "256B" implementation, while +# assembler results are for "528B";-) # (**) second number is result for code compiled with -fPIC flag, # which is actually more relevant, because assembler code is # position-independent; @@ -44,7 +45,7 @@ # May 2010 # -# Add PCLMULQDQ version performing at 2.13 cycles per processed byte. +# Add PCLMULQDQ version performing at 2.10 cycles per processed byte. # The question is how close is it to theoretical limit? The pclmulqdq # instruction latency appears to be 14 cycles and there can't be more # than 2 of them executing at any given time. This means that single @@ -60,38 +61,36 @@ # Before we proceed to this implementation let's have closer look at # the best-performing code suggested by Intel in their white paper. # By tracing inter-register dependencies Tmod is estimated as ~19 -# cycles and Naggr is 4, resulting in 2.05 cycles per processed byte. -# As implied, this is quite optimistic estimate, because it does not -# account for Karatsuba pre- and post-processing, which for a single -# multiplication is ~5 cycles. Unfortunately Intel does not provide -# performance data for GHASH alone, only for fused GCM mode. But -# we can estimate it by subtracting CTR performance result provided -# in "AES Instruction Set" white paper: 3.54-1.38=2.16 cycles per -# processed byte or 5% off the estimate. It should be noted though -# that 3.54 is GCM result for 16KB block size, while 1.38 is CTR for -# 1KB block size, meaning that real number is likely to be a bit -# further from estimate. +# cycles and Naggr chosen by Intel is 4, resulting in 2.05 cycles per +# processed byte. As implied, this is quite optimistic estimate, +# because it does not account for Karatsuba pre- and post-processing, +# which for a single multiplication is ~5 cycles. Unfortunately Intel +# does not provide performance data for GHASH alone. But benchmarking +# AES_GCM_encrypt ripped out of Fig. 15 of the white paper with aadt +# alone resulted in 2.46 cycles per byte of out 16KB buffer. Note that +# the result accounts even for pre-computing of degrees of the hash +# key H, but its portion is negligible at 16KB buffer size. # # Moving on to the implementation in question. Tmod is estimated as # ~13 cycles and Naggr is 2, giving asymptotic performance of ... # 2.16. How is it possible that measured performance is better than # optimistic theoretical estimate? There is one thing Intel failed -# to recognize. By fusing GHASH with CTR former's performance is -# really limited to above (Tmul + Tmod/Naggr) equation. But if GHASH -# procedure is detached, the modulo-reduction can be interleaved with -# Naggr-1 multiplications and under ideal conditions even disappear -# from the equation. So that optimistic theoretical estimate for this -# implementation is ... 28/16=1.75, and not 2.16. Well, it's probably -# way too optimistic, at least for such small Naggr. I'd argue that -# (28+Tproc/Naggr), where Tproc is time required for Karatsuba pre- -# and post-processing, is more realistic estimate. In this case it -# gives ... 1.91 cycles per processed byte. Or in other words, -# depending on how well we can interleave reduction and one of the -# two multiplications the performance should be betwen 1.91 and 2.16. -# As already mentioned, this implementation processes one byte [out -# of 1KB buffer] in 2.13 cycles, while x86_64 counterpart - in 2.07. -# x86_64 performance is better, because larger register bank allows -# to interleave reduction and multiplication better. +# to recognize. By serializing GHASH with CTR in same subroutine +# former's performance is really limited to above (Tmul + Tmod/Naggr) +# equation. But if GHASH procedure is detached, the modulo-reduction +# can be interleaved with Naggr-1 multiplications at instruction level +# and under ideal conditions even disappear from the equation. So that +# optimistic theoretical estimate for this implementation is ... +# 28/16=1.75, and not 2.16. Well, it's probably way too optimistic, +# at least for such small Naggr. I'd argue that (28+Tproc/Naggr), +# where Tproc is time required for Karatsuba pre- and post-processing, +# is more realistic estimate. In this case it gives ... 1.91 cycles. +# Or in other words, depending on how well we can interleave reduction +# and one of the two multiplications the performance should be betwen +# 1.91 and 2.16. As already mentioned, this implementation processes +# one byte out of 8KB buffer in 2.10 cycles, while x86_64 counterpart +# - in 2.02. x86_64 performance is better, because larger register +# bank allows to interleave reduction and multiplication better. # # Does it make sense to increase Naggr? To start with it's virtually # impossible in 32-bit mode, because of limited register bank diff --git a/crypto/modes/asm/ghash-x86_64.pl b/crypto/modes/asm/ghash-x86_64.pl index b80be6c742..34edb397eb 100644 --- a/crypto/modes/asm/ghash-x86_64.pl +++ b/crypto/modes/asm/ghash-x86_64.pl @@ -20,17 +20,18 @@ # gcc 3.4.x(*) assembler # # P4 28.6 14.0 +100% -# Opteron 18.5 7.7 +140% -# Core2 17.5 8.1(**) +115% +# Opteron 19.3 7.7 +150% +# Core2 17.8 8.1(**) +120% # # (*) comparison is not completely fair, because C results are -# for vanilla "256B" implementation, not "528B";-) +# for vanilla "256B" implementation, while assembler results +# are for "528B";-) # (**) it's mystery [to me] why Core2 result is not same as for # Opteron; # May 2010 # -# Add PCLMULQDQ version performing at 2.07 cycles per processed byte. +# Add PCLMULQDQ version performing at 2.02 cycles per processed byte. # See ghash-x86.pl for background information and details about coding # techniques. # -- 2.40.0