# Optimizing Keccak
###### tags: `keccak`, `eth1.x`, `c`
## Introduction
It all started by Turbo-Geth showing me the CPU profile where Keccak hashing in turbogeth had the top position.
The [ethash](https://github.com/chfast/ethash) has simple C Keccak implementation.
The "permute" function `keccakf1600` (aka Keccak-_f_[1600], Keccak-_p_[1600,24])
is some "simple" implementation which looks similar to "generic-64" one from [XKCP](https://github.com/XKCP/XKCP). Back when I was working on this, this implementation was performing the best compared to 2-3 other non-assembly implementations.
At some point we were also checking the ethash performance against geth. After some changes in geth the performance matched. It was also noticeable that geth has faster Keccak implementation.
Péter had written it in Go-ASM porting from XKCP. I'm guessing from [KeccakP-1600-x86-64-gas.s](https://github.com/XKCP/XKCP/blob/master/lib/low/KeccakP-1600/x86-64/KeccakP-1600-x86-64-gas.s).
## AVX2 assembly
The AVX2 implementation looked very promising compared to pure x86_64 one. Originally written by in [CRYPROGAMS](https://github.com/dot-asm/cryptogams/blob/master/x86_64/keccak1600-avx2.pl). Adopted by [OpenSSL](https://github.com/openssl/openssl/blob/master/crypto/sha/asm/keccak1600-avx2.pl) and [XKCP](https://github.com/XKCP/XKCP/blob/master/lib/low/KeccakP-1600/AVX2/KeccakP-1600-AVX2.s). However, from the information available in the description of the implementation it might perform worse than pure x86_64 assembly on some AMD CPUs with poor AVX2 implementations.
This was indeed confirmed by benchmarks.
I added XKCP AVX2 assembly implementation of the Keccak-f[1600] function to the ethash::keccak library for benchmarking. It compiles with GCC/Clang on Linux and with some issues on macOS. It probably requires different Keccak state layout because it produces _incompatible results_. More work is needed to bring it to properly working state.
Conclusions from benchmarks:
- For some Intel CPUs the AVX2 implementation is **10–15% faster** than the second fastest one,
- For some Intel CPUs the AVX2 implementation has comparable performance to the second fastest one,
- For an AMD CPU (EPYC 7601 32-Core) the AVX2 implementation is **2x slower** than the fastest one.
## x86_64 assembly
I also tried a x86_64 assembly implementation of the Keccak-f[1600] from [XKCP/KeccakP-1600-x86-64-gas.s](https://github.com/XKCP/XKCP/blob/master/lib/low/KeccakP-1600/x86-64/KeccakP-1600-x86-64-gas.s) Similarly to the AVX2 implementation this produces _incompatible results_. This compiles fine on Linux, but not on macOS but there is the `_Apple.s` variant file available which I have not touched.
This implementation unrolls all 24 rounds of the Keccak. This looks very wasteful to CPU instruction cache.
From benchmarks this have the same performance as OpenSSL's C implementation so using assembly is not beneficial.
For the record I also tried "generic-64" C implementation from [XKCP/KeccakP-1600-opt64.c](https://github.com/XKCP/XKCP/blob/master/lib/low/KeccakP-1600/plain-64bits/KeccakP-1600-opt64.c). It also unrolls all 24 rounds by default. From a set of initial benchmarks this was performing slightly worse than the original ethash::keccak implementation so I dropped it at that point.
## OpenSSL C implementation
TODO
Check out the OpenSSL SHA3 C implementation. Looks nicely selfcontained: https://github.com/openssl/openssl/blob/master/crypto/sha/keccak1600.c
## Benchmarking Results
### Collecting benchmarking data
_This may not be reproducible as the code is being modified._
#### Build ethash
If possible build with latest GCC (10) or Clang (11) by setting `CC` and `CXX` variables, e.g. `CC=clang-11 CXX=clang++-11`, in script below.
```bash
git clone https://github.com/chfast/ethash --single-branch --no-tags -b keccak_experiments
cd ethash
CC=cc CXX=c++ cmake -S . -B build
cmake --build build --target ethash-bench
```
> On macOS likely this will fail. To fix:
> 1. add `macOS:` after `.text` in `lib/keccak/KeccakP-1600-AVX2.s`.
> 2. remove `lib/keccak/KeccakP-1600-x86-64-gas.s`, remove it from `lib/keccak/CMakeLists.txt`, and remove all references to `keccakf1600_asm` from `test/benchmarks/keccak_benchmarks.cpp`.
#### Run Keccak-_f_[1600] benchmarks
```bash
build/test/ethash-bench --benchmark_filter=keccakf1600 --benchmark_repetitions=20 --benchmark_report_aggregates_only
```
Report the results below with the CPU and compiler used.
### Initial benchmarks: ethash::keccak vs XKCP
<details><summary>Details</summary>
<p>
- `keccakf1600`: ethash::keccak
- `keccakf1600_new`: XKCP generic-64
- `keccakf1600_avx2` XKCP AVX2 assembly
- `keccakf1600_asm` XKCP x86_64 assembly
```
2020-11-26 12:40:06
Running test/ethash-bench
Run on Haswell 4.4 GHz
CPU Caches:
L1 Data 32K (x4)
L1 Instruction 32K (x4)
L2 Unified 256K (x4)
L3 Unified 8192K (x1)
GCC 10
--------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------
keccakf1600 376 ns 376 ns 1862281
keccakf1600_new 389 ns 389 ns 1798183
keccakf1600_avx2 283 ns 283 ns 2474567
keccakf1600_asm 333 ns 333 ns 2101909
Clang 11
--------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------
keccakf1600 376 ns 376 ns 1862638
keccakf1600_new 383 ns 383 ns 1825692
keccakf1600_avx2 264 ns 264 ns 2651116
keccakf1600_asm 331 ns 331 ns 2111962
```
```
2020-11-26 12:46:57
Running test/ethash-bench
Run on Whiskey Lake 4.6 GHz
CPU Caches:
L1 Data 32K (x4)
L1 Instruction 32K (x4)
L2 Unified 256K (x4)
L3 Unified 8192K (x1)
GCC 10
--------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------
keccakf1600 356 ns 356 ns 1974381
keccakf1600_new 369 ns 369 ns 1892730
keccakf1600_avx2 238 ns 238 ns 2994248
keccakf1600_asm 341 ns 341 ns 2125399
Clang 11
--------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------
keccakf1600 372 ns 372 ns 1906351
keccakf1600_new 370 ns 370 ns 1889851
keccakf1600_avx2 236 ns 236 ns 2949221
keccakf1600_asm 330 ns 330 ns 2147454
```
```
2020-11-26 12:02:02
Running test/ethash-bench
Run on EPYC 7601 2.2 GHz
CPU Caches:
L1 Data 64K (x4)
L1 Instruction 64K (x4)
L2 Unified 512K (x4)
L3 Unified 16384K (x4)
GCC 10
--------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------
keccakf1600 552 ns 552 ns 1259894
keccakf1600_new 585 ns 584 ns 1202035
keccakf1600_avx2 918 ns 918 ns 765964
keccakf1600_asm 502 ns 502 ns 1409218
```
</p>
</details>
### Middle-stage benchmarks: ethash::keccak, XKCP, OpenSSL
<details><summary>Details</summary>
<p>
```
Paweł Bylica
Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
Clang 11
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
keccakf1600_simple_mean 907 ns 907 ns 770842
keccakf1600_simple_median 904 ns 904 ns 770842
keccakf1600_simple_stddev 8 ns 8 ns 770842
keccakf1600_simple_bmi_mean 686 ns 686 ns 1017456
keccakf1600_simple_bmi_median 685 ns 685 ns 1017456
keccakf1600_simple_bmi_stddev 5 ns 5 ns 1017456
keccakf1600_simple_bmi2_mean 642 ns 642 ns 1089955
keccakf1600_simple_bmi2_median 640 ns 640 ns 1089955
keccakf1600_simple_bmi2_stddev 7 ns 7 ns 1089955
keccakf1600_simple_avx2_mean 641 ns 641 ns 1089743
keccakf1600_simple_avx2_median 640 ns 639 ns 1089743
keccakf1600_simple_avx2_stddev 6 ns 6 ns 1089743
keccakf1600_simple_haswell_mean 643 ns 643 ns 1052362
keccakf1600_simple_haswell_median 640 ns 640 ns 1052362
keccakf1600_simple_haswell_stddev 9 ns 9 ns 1052362
keccakf1600_new_mean 885 ns 885 ns 762584
keccakf1600_new_median 884 ns 884 ns 762584
keccakf1600_new_stddev 7 ns 7 ns 762584
keccakf1600_openssl_mean 767 ns 767 ns 863814
keccakf1600_openssl_median 764 ns 764 ns 863814
keccakf1600_openssl_stddev 8 ns 8 ns 863814
keccakf1600_avx2_mean 575 ns 575 ns 1183177
keccakf1600_avx2_median 574 ns 574 ns 1183177
keccakf1600_avx2_stddev 3 ns 3 ns 1183177
keccakf1600_asm_mean 765 ns 765 ns 911071
keccakf1600_asm_median 763 ns 763 ns 911071
keccakf1600_asm_stddev 6 ns 6 ns 911071
```
```
Paweł Bylica
AMD EPYC 7601 32-Core Processor @ 2.2 GHz
GCC 10
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
keccakf1600_simple_mean 547 ns 547 ns 1282973
keccakf1600_simple_median 547 ns 547 ns 1282973
keccakf1600_simple_stddev 1 ns 1 ns 1282973
keccakf1600_simple_bmi_mean 443 ns 443 ns 1576725
keccakf1600_simple_bmi_median 443 ns 443 ns 1576725
keccakf1600_simple_bmi_stddev 0 ns 0 ns 1576725
keccakf1600_simple_bmi2_mean 452 ns 452 ns 1548466
keccakf1600_simple_bmi2_median 452 ns 452 ns 1548466
keccakf1600_simple_bmi2_stddev 1 ns 1 ns 1548466
keccakf1600_simple_avx2_mean 443 ns 442 ns 1579045
keccakf1600_simple_avx2_median 442 ns 442 ns 1579045
keccakf1600_simple_avx2_stddev 1 ns 1 ns 1579045
keccakf1600_simple_haswell_mean 443 ns 443 ns 1588185
keccakf1600_simple_haswell_median 443 ns 443 ns 1588185
keccakf1600_simple_haswell_stddev 1 ns 1 ns 1588185
keccakf1600_new_mean 578 ns 578 ns 1209559
keccakf1600_new_median 578 ns 578 ns 1209559
keccakf1600_new_stddev 0 ns 0 ns 1209559
keccakf1600_openssl_mean 490 ns 490 ns 1422030
keccakf1600_openssl_median 490 ns 490 ns 1422030
keccakf1600_openssl_stddev 1 ns 1 ns 1422030
keccakf1600_avx2_mean 905 ns 905 ns 773038
keccakf1600_avx2_median 905 ns 905 ns 773038
keccakf1600_avx2_stddev 1 ns 1 ns 773038
keccakf1600_asm_mean 494 ns 493 ns 1418425
keccakf1600_asm_median 493 ns 493 ns 1418425
keccakf1600_asm_stddev 1 ns 1 ns 1418425
```
```
axic
i5-6360U macOS
Run on (4 X 2000 MHz CPU s)
clang something
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
keccakf1600_simple_mean 1009 ns 653 ns 1019650
keccakf1600_simple_median 997 ns 635 ns 1019650
keccakf1600_simple_stddev 248 ns 67 ns 1019650
keccakf1600_simple_bmi_mean 736 ns 475 ns 1475445
keccakf1600_simple_bmi_median 670 ns 461 ns 1475445
keccakf1600_simple_bmi_stddev 243 ns 46 ns 1475445
keccakf1600_simple_bmi2_mean 617 ns 451 ns 1675956
keccakf1600_simple_bmi2_median 553 ns 437 ns 1675956
keccakf1600_simple_bmi2_stddev 153 ns 35 ns 1675956
keccakf1600_simple_avx2_mean 612 ns 451 ns 1667993
keccakf1600_simple_avx2_median 559 ns 443 ns 1667993
keccakf1600_simple_avx2_stddev 213 ns 38 ns 1667993
keccakf1600_simple_haswell_mean 583 ns 457 ns 1037514
keccakf1600_simple_haswell_median 502 ns 431 ns 1037514
keccakf1600_simple_haswell_stddev 231 ns 62 ns 1037514
keccakf1600_new_mean 690 ns 597 ns 1207396
keccakf1600_new_median 643 ns 577 ns 1207396
keccakf1600_new_stddev 119 ns 39 ns 1207396
keccakf1600_openssl_mean 599 ns 502 ns 1367107
keccakf1600_openssl_median 567 ns 497 ns 1367107
keccakf1600_openssl_stddev 95 ns 24 ns 1367107
keccakf1600_avx2_mean 447 ns 360 ns 2022023
keccakf1600_avx2_median 412 ns 360 ns 2022023
keccakf1600_avx2_stddev 86 ns 10 ns 2022023
```
```
Andrew Ashikhmin
Intel(R) Core(TM) i7-8705G CPU @ 3.10GHz
gcc (Ubuntu 10.2.0-13ubuntu1) 10.2.0
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
keccakf1600_simple_mean 449 ns 449 ns 1177318
keccakf1600_simple_median 440 ns 440 ns 1177318
keccakf1600_simple_stddev 24 ns 24 ns 1177318
keccakf1600_simple_bmi_mean 341 ns 339 ns 2055382
keccakf1600_simple_bmi_median 339 ns 338 ns 2055382
keccakf1600_simple_bmi_stddev 5 ns 4 ns 2055382
keccakf1600_simple_bmi2_mean 330 ns 328 ns 2153489
keccakf1600_simple_bmi2_median 327 ns 325 ns 2153489
keccakf1600_simple_bmi2_stddev 8 ns 8 ns 2153489
keccakf1600_simple_avx2_mean 337 ns 336 ns 2120534
keccakf1600_simple_avx2_median 332 ns 331 ns 2120534
keccakf1600_simple_avx2_stddev 13 ns 13 ns 2120534
keccakf1600_simple_haswell_mean 327 ns 326 ns 2076352
keccakf1600_simple_haswell_median 324 ns 323 ns 2076352
keccakf1600_simple_haswell_stddev 8 ns 8 ns 2076352
keccakf1600_new_mean 484 ns 484 ns 1548000
keccakf1600_new_median 485 ns 485 ns 1548000
keccakf1600_new_stddev 23 ns 22 ns 1548000
keccakf1600_openssl_mean 419 ns 419 ns 1646464
keccakf1600_openssl_median 419 ns 419 ns 1646464
keccakf1600_openssl_stddev 17 ns 17 ns 1646464
keccakf1600_avx2_mean 288 ns 288 ns 2441960
keccakf1600_avx2_median 287 ns 287 ns 2441960
keccakf1600_avx2_stddev 8 ns 8 ns 2441960
keccakf1600_asm_mean 389 ns 388 ns 1854061
keccakf1600_asm_median 379 ns 378 ns 1854061
keccakf1600_asm_stddev 19 ns 19 ns 1854061
```
```
Andrew Ashikhmin
Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
Apple clang version 12.0.0 (clang-1200.0.32.21)
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
keccakf1600_simple_mean 398 ns 398 ns 1656373
keccakf1600_simple_median 395 ns 395 ns 1656373
keccakf1600_simple_stddev 12 ns 12 ns 1656373
keccakf1600_simple_bmi_mean 336 ns 336 ns 2107856
keccakf1600_simple_bmi_median 334 ns 333 ns 2107856
keccakf1600_simple_bmi_stddev 5 ns 5 ns 2107856
keccakf1600_simple_bmi2_mean 324 ns 322 ns 2271002
keccakf1600_simple_bmi2_median 318 ns 316 ns 2271002
keccakf1600_simple_bmi2_stddev 21 ns 18 ns 2271002
keccakf1600_simple_avx2_mean 304 ns 303 ns 2129556
keccakf1600_simple_avx2_median 299 ns 299 ns 2129556
keccakf1600_simple_avx2_stddev 14 ns 13 ns 2129556
keccakf1600_simple_haswell_mean 292 ns 292 ns 2365496
keccakf1600_simple_haswell_median 292 ns 291 ns 2365496
keccakf1600_simple_haswell_stddev 5 ns 5 ns 2365496
keccakf1600_new_mean 471 ns 471 ns 1478228
keccakf1600_new_median 472 ns 471 ns 1478228
keccakf1600_new_stddev 6 ns 6 ns 1478228
keccakf1600_openssl_mean 385 ns 385 ns 1722466
keccakf1600_openssl_median 385 ns 385 ns 1722466
keccakf1600_openssl_stddev 6 ns 6 ns 1722466
keccakf1600_avx2_mean 320 ns 320 ns 2140915
keccakf1600_avx2_median 319 ns 319 ns 2140915
keccakf1600_avx2_stddev 6 ns 6 ns 2140915
```
```
2020-12-02 06:31:35
Intel(R) Core(TM) i3-5010U CPU @ 2.10GHz
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
clang version 10.0.0-4ubuntu1
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
keccakf1600_simple_mean 812 ns 812 ns 879096
keccakf1600_simple_median 803 ns 803 ns 879096
keccakf1600_simple_stddev 23 ns 23 ns 879096
keccakf1600_simple_bmi_mean 615 ns 615 ns 1120637
keccakf1600_simple_bmi_median 611 ns 611 ns 1120637
keccakf1600_simple_bmi_stddev 15 ns 15 ns 1120637
keccakf1600_simple_bmi2_mean 585 ns 585 ns 1181294
keccakf1600_simple_bmi2_median 585 ns 585 ns 1181294
keccakf1600_simple_bmi2_stddev 7 ns 7 ns 1181294
keccakf1600_simple_avx2_mean 586 ns 586 ns 1078937
keccakf1600_simple_avx2_median 581 ns 581 ns 1078937
keccakf1600_simple_avx2_stddev 13 ns 13 ns 1078937
keccakf1600_simple_haswell_mean 584 ns 584 ns 1167725
keccakf1600_simple_haswell_median 580 ns 580 ns 1167725
keccakf1600_simple_haswell_stddev 7 ns 7 ns 1167725
keccakf1600_new_mean 797 ns 798 ns 873087
keccakf1600_new_median 792 ns 792 ns 873087
keccakf1600_new_stddev 18 ns 18 ns 873087
keccakf1600_openssl_mean 677 ns 677 ns 1020223
keccakf1600_openssl_median 675 ns 675 ns 1020223
keccakf1600_openssl_stddev 4 ns 4 ns 1020223
keccakf1600_avx2_mean 560 ns 560 ns 1251563
keccakf1600_avx2_median 559 ns 559 ns 1251563
keccakf1600_avx2_stddev 3 ns 3 ns 1251563
keccakf1600_asm_mean 677 ns 677 ns 1035345
keccakf1600_asm_median 675 ns 675 ns 1035345
keccakf1600_asm_stddev 8 ns 9 ns 1035345
```
</p>
</details>
## Benchmarking conclusions
1. We should continue with C/C++ implementation avoiding usage of assembly.
2. We should enable [BMI and BMI2](https://en.wikipedia.org/wiki/Bit_manipulation_instruction_set) instruction set extensions on x86_64 architecture (see [BMI deployment](#BMI-deployment) for possible ways of doing so).
a. The `BMI` allows a compiler to use `andn` instruction which is `andn(a, b): ~a & b`.
b. The `BMI2` allows a compiler to use `rorx` instruction in place of regular `ror` instruction.
## BMI deployment
For the best performance we want to enable [BMI and BMI2](https://en.wikipedia.org/wiki/Bit_manipulation_instruction_set) extensions (later referred as just BMI) for x86_64 architecture.
1. Build with BMI by default.
a. Add CMake option `KECCAK_BMI` `ON` by default.
b. This can be promoted to `ETHASH_BMI` for whole package.
c. This will cause "invalid instruction" crash on hardware without BMI.
2. Build with BMI if supported by build hardware.
a. Same as 1, but initial `KECCAK_BMI` depends on BMI support check during build configuration.
b. Bad for distributing binaries because depends on CI hardware.
c. Good for users building from source because they will always get working version.
3. Runtime abort check.
a. Inherits from 1 or 2.
b. In the BMI-enabled build add a startup check if BMI is supported. If not abort the process with a message with more details.
4. Dynamic implementation selection.
a. Select the best implementation at runtime.
b. This can be done by a mechanism similar to [Function Multiversioning](https://gcc.gnu.org/wiki/FunctionMultiVersioning).
c. The implementation is resolved once (at first call?), then every call is indirect call (may affect performance in static libs but should be indifferent in shared libraries).
d. See example: https://godbolt.org/z/d5Wvdd.
## TODO
1. The implementation is effectively OpenSSL's KECCAK_2X without TRANSFORMATION.
2. But it uses different syntax which allows compiler to produce tiny better code.
3. Although the OpenSSL's implementation is more compact (and probably more readable) I decided to stick to the current syntax.
4. The source of original implementation: https://gist.github.com/chfast/b4ec6218ef486222edaf5e71cc0f209a