Benchmarking EVM instructions

# Benchmarking EVM instructions ## Introducing synthetic EVM benchmarks The evmone project has gained EVM "synthetic" benchmarks recently ([PR#278](https://github.com/ethereum/evmone/pull/278)). This is a set of on-demand generated EVM bytecodes available in the `evmone-bench` tool. Each bytecode tries to stress a single EVM instruction. The initial set of instructions have been selected by the following criteria: - gas cost is constant (does not depend on instruction inputs), - instruction implementation is constant-time (or at least straight-forward and simple) — this eliminates instructions like `DIV`, `MOD` or `ADDMOD`, - instruction does not access memory directly — in our opinion EVM memory benchmarking requires separate research and design, - instruction does not access Ethereum state (e.g. `SLOAD`, `EXTCODEHASH`) or any other information outside of the EVM call frame (e.g. `DIFFICULTY`) — performance of these instruction is dominated by blockchain data access and requires different benchmarking infrastructure. TODO: Add appendix with all included instructions grouped by kinds. The selected instructions are further grouped by their kind: - nop, - unary operators, - binary operators, - nullary operators, - pushes, - swaps, - dups. The kind of the instruction determines the EVM stack balance requirements and therefore the benchmark bytecode structure. ## Benchmark bytecode structure ### Loop In the first prototype I tried using straight-line EVM code. Such approach turned out be problematic (no surprise really) because the timings were greatly affected by the `JUMPDEST` analysis. This is especially visible for evmone Advanced which trades intensive code analysis for faster execution later. But results are screwed also in evmone Baseline which performs only absolutely minimal `JUMPDEST` analysis. It must be noted that the time of JUMPDEST analysis is proportional to code size. And the same property holds for straight-line code execution. One of the option to separate the two is to directly use evmone's API where execution time can be measured after the analysis has competed. The disadvantage of such solution is lose of generality and possible interoperability with other tools and EVM implementations for which bytecode may be delivered by command-line interface. Therefore I decided to solve the problem differently. The essential straight-line benchmark code has been wrapped with a loop with fixed number of iterations. The loop can be represented by the following pseudo-code. ```javascript= var counter = -255 do { // benchmark inner code } while ((counter += 1) != 0) ``` The exact EVM implementation is as follows. ```evm= PUSH32 0xffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff01 JUMPDEST // benchmark inner code PUSH1 1 ADD DUP1 PUSH1 33 JUMPI ``` This version of the loop has the minimum number of instructions and does not use EVM memory. It leaves the loop counter as the only EVM stack item. Generated inner codes must take the loop counter into account — they are allowed to duplicate and use it but must not modify nor drop it. The number of loop iterations can be used to control the analysis time to execution time ratio. With the presented configuration the loop is executed 256 times. This keeps the analysis time being ~1% of total execution time for evmone Advanced. The analysis in evmone Baseline is at least 4 times faster. The execution time of the loop itself is at most 2% of total execution time. This can be confirmed by paying attention to `execute/synth/loop` benchmarks executing the loop with empty inner code. ### Instruction categories To support the bytecode generator instructions are categorized by their stack requirements, i.e. how deep they access EVM stack for inputs and how stack height is being changed by instruction execution. There are also other properties taken into account for category assignments. 1. No-op _n_ — does not have any inputs nor outputs. In this category we only have `JUMPDEST`. 2. Nullary operator — has no inputs and one result, stack height is increased by 1. E.g. `MSIZE`, `CALLER`. 3. Unary operator — has one input and one result, stack height remains unchanged. E.g. `ISZERO`, `NOT`. 4. Binary operator — has two inputs and one result, stack height is decreased by 1. E.g. `MUL`, `EQ`. 5. Push — all `PUSH` instructions. They push a value to the stack therefore work as nullary operator, but separate category is useful to handle generating immediate values following any `PUSH` opcode. 6. Dup — all `DUP` instructions. They push a value to the stack and have variadic stack height requirements which can be deduced easily from the opcode number. 7. Swap — all `SWAP` instructions. They do not change stack height and have variadic stack height requirements which can be deduced easily from the opcode number. ### Inner code structure The structure of the generated benchmark loop inner code is defined by the instruction category. Moreover, we define one more property called "stack mode": - _0_ "minimal stack" — the generated code increases the EVM stack height minimally, - _1_ "full stack" — the generated code utilizes the EVM stack up to its height limit (1024). For all defined instruction categories and modes we have 11 different generators (unary, binary and swap instructions do not go with "full stack" mode). #### Example 1: Benchmark of `PUSH1` in "full stack" mode (_p1_) The `PUSH1` is an instruction taking no arguments and producing a result pushed to the EVM stack. Therefore, the newly produced values must be balanced with the matching number of `POP` instructions. To utilized all EVM stack height but also to accommodate for the loop counter already on the stack, we define "stack limit" constant $1023$. The benchmark loop inner code first pushes 1023 new values to the EVM stack which are then later dropped by the 1023 `POP` instructions. For push instructions we always generate zeroed "push data" of required length. ``` 1023 * (PUSH1 + "00") + 1023 * POP ``` #### Example 2: Benchmark of `CALLER` in "min stack" mode (_n0_) Similarly to `PUSH1` instruction, the `CALLER` instructions must be balanced by the matching number of `POP` instructions. However, in the "min stack" mode these are interleaved. I.e. each `CALLER` instructions is followed by a `POP` instruction. Although an arbitrary number of `CALLER`+`POP` pairs can be used, the constant 1023 is selected as in the "full stack" mode. This gives us the property that "min stack" and "full stack" modes of the same main instruction differ only by the instruction execution order. ``` 1023 * (CALLER + POP) ``` ## EVM implementation overview For this work the evmone Baseline EVM implementation was used. The Baseline is "classic" and "naive" EVM implementation in C++: - it performs only required minimal JUMPDEST analysis, although this analysis overhead is mitigated by the benchmark design (see "loop"), - it uses basic `switch` statement for the interpreter dispatch, - it contains code section common for all instructions which checks pre-execution instruction requirements: stack overflow/underflow and gas cost, - specialized [intx](https://github.com/chfast/intx) library is used to implement arithmetic 256-bit operations—decent performance is expected in this area. ## Collecting data In this section we present how to collect data from benchmark runs. This process is reproducible but not automated yet and requires performing a number of manual steps. The process will be automated later provided the presented approach to the analysis of the performance of low-level EVM instructions is well perceived. TODO: Point to evmone git tag. For this piece of work we only collected data for 5 selected instructions: `JUMPDEST`, `NOT`, `MUL`, `PUSH32`, `DUP1`. (TODO: explain some of the selections). Data is available in [this spreadsheet](https://docs.google.com/spreadsheets/d/1OyvFQZON3AfwGLPgHutOEi03DSBI6rR5h8QS1YNqEpQ/edit?usp=sharing). The essential results come from the `evmone-bench` tool compiled with [Clang](https://clang.llvm.org) 11 on Linux. Internally, the `evmone-bench` uses the [Benchmark](https://github.com/google/benchmark) library as the benchmark driver and CLI interface. For the results presented in this article the number of driver **iterations** per benchmark case has been fixed to **1000**. By default the Benchmark library auto-selects the number of iterations based on the benchmark case execution time. This however increases the tool startup time and causes increased interference with the `perf` tool. In total four different CPU configurations were used: two CPUs with turbo mode enabled/disabled: - Haswell architecture: [Intel Core i7-4790K](https://ark.intel.com/content/www/us/en/ark/products/80807/intel-core-i7-4790k-processor-8m-cache-up-to-4-40-ghz.html) - turbo clock frequency: 4.4 GHz - base clock frequency: 4.0 GHz - Skylake architecture: [Intel Core i7-8565U](https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html) - turbo clock frequency: 4.6 GHz - base clock frequency: 1.8 GHz ### Instructions 1. Use `evm_benchmarking` branch. TODO: Add git tag. 2. Build evmone project with `-DCMAKE_BUILD_TYPE=Release -DEVMONE_TESTING=ON` CMake options. 3. Synthetic benchmark bytecodes may be dumped to the current working directory for later examination. ```bash > BENCH_DUMP=1 bin/evmone-bench ``` 4. Inspect instruction execution "histogram": we are interested in "main" and "all" instructions counts. In the following example there are 523261 instructions executed in total out of which 260865 are `ADD` instructions. ``` > evmc/bin/evmc run --vm ./lib/libevmone.so,O=0,histogram --gas 100000000 $(xxd -p -c100000 < ADD_b0) Executing on Istanbul with 100000000 gas limit in ./lib/libevmone.so,O=0,histogram Histogram: ADD,260865 POP,255 JUMPI,255 JUMPDEST,255 PUSH1,510 PUSH32,1 DUP1,261120 all,523261 Result: success Gas used: 1570803 Output: ``` 5. Use `perf` tool to collect CPU stats, preferable with benchmark iterations fixed. ``` > perf stat --no-big-num -e 'task-clock,cycles,instructions,branches,branch-misses,L1-dcache-loads,L1-dcache-load-misses' bin/evmone-bench --benchmark_filter=baseline/execute/synth/ADD/b0 2021-02-11 13:51:54 Running bin/evmone-bench Run on (8 X 4400 MHz CPU s) CPU Caches: L1 Data 32K (x4) L1 Instruction 32K (x4) L2 Unified 256K (x4) L3 Unified 8192K (x1) ----------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ----------------------------------------------------------------------------------------------------- baseline/execute/synth/ADD/b0/iterations:1000 1972 us 1972 us 1000 gas_rate=796.723M/s gas_used=1.5708M Performance counter stats for 'bin/evmone-bench --benchmark_filter=baseline/execute/synth/ADD/b0': 1976,20 msec task-clock # 0,991 CPUs utilized 7896034312 cycles # 3,996 GHz (33,40%) 24368892010 instructions # 3,09 insn per cycle (50,20%) 4200562887 branches # 2125,580 M/sec (66,80%) 1121673 branch-misses # 0,03% of all branches (83,40%) 8915888294 L1-dcache-loads # 4511,642 M/sec (82,79%) 761855 L1-dcache-load-misses # 0,01% of all L1-dcache hits (33,20%) 1,994478506 seconds time elapsed 1,976448000 seconds user 0,000000000 seconds sys ``` TODO - Add appendix with raw benchmark results for Haswell 4.4 ## Estimating gas cost The results from synthetic benchmarks can be analyzed to estimate the boundaries of instructions execution costs. "Properties" - Times converted to CPU cycles - Scaled to 2.3 GHz CPU speed (scaling vector) - Gas target: 100 Mgas /s, 1 gas / 10 ns. ### Notation - $f$ - CPU frequency - $T$ - CPU time - $C$ - CPU cycles: $C \approx fT$ - $k$ - benchmark iterations ### Counting cycles For a single benchmark case we receive 3 different "time" measurements. The `evmone-bench` reports bytecode execution CPU time $T_{TOTAL}$. The tool actually executes the bytecode in a loop fixed number of iterations $k$. The `evmone-bench` process is also measured with `perf` and gives CPU time $T_{perf}$ (_task-clock_ event) and CPU cycles $C_{perf}$ (_cycles_ event). Some additional overhead related to the process startup and tool initialization $T_{startup}$ is included in $T_{perf}$. Although, we deliberately keep this overhead below 2% by running the tool benchmark loop long enough — 1000 iterations is enough for the fastest `JUMPDEST` benchmark. (TODO: This equation is probably not useful any more) $$ T_{perf} = T_{startup} + kT_{TOTAL} $$ The overhead being small could be ignored or made even smaller by increasing the number or iterations $k$. Yet we decided to spend a bit of time on how to compensate the overhead $T_{startup}$. We are interested in estimating $C_{TOTAL}$ having measured $C_{perf}$, $T_{perf}$ and $T_{TOTAL}$. The best CPU frequency estimation $f \approx \frac{C_{perf}}{T_{perf}}$ is derived from the perf stats of the executed process. Finally $$ C_{TOTAL} \approx fT_{TOTAL} \approx \frac{C_{perf}}{T_{perf}}T_{TOTAL} $$ ### Lower boundary of the interpreter loop cost The good first step is to estimate the EVM interpreter loop cost. This information is going to be used in further cost analysis. Ideally it would be to execute the interpreter with a lot of `NOP` instructions. Unfortunately, EVM does not have `NOP` instruction. However, in some implementations (including evmone Baseline, but not the evmone Advanced) the `JUMPDEST` is good substitute. This instruction usually affects only JUMPDEST analysis performed before execution but has nothing to do during execution. For Baseline this works great because: - `JUMPDEST` has empty implementation, - yet `JUMPDEST` instructions are not removed from the internal program representation (this requires bytecode copy and modification what Baseline decided not to do), - it contains common code in the interpreter loop checking instruction gas and stack requirements (although the implementation may alter this property in future) — execution times are not going to be close to zero. The `synth/JUMPDEST/n0` benchmark gives lower boundary runtime cost of any EVM instruction because: - it is assumed any instruction will be executed longer that `JUMPDEST`, - the benchmark bytecode contains mostly long sequences of `JUMPDEST` instructions what is CPU branch predictor friendly, - only up to two EVM stack items are ever used (by the benchmark loop), - it is assumed that the benchmark loop instructions have the same cost as `JUMPDEST`, i.e. we can count them as another `JUMPDEST` instantiation. The `synth/JUMPDEST/n0` benchmark has 2047 `JUMPDEST` instructions inside the loop (including leading loop jump destination) and 5 loop iteration instructions: `ADD`, `JUMPI`, 2x`PUSH1` and `DUP1`. There is also single `PUSH32` instruction and the very beginning of the bytecode with the initial loop counter value. All 255 loop iterations gives 523261 instructions. ``` > evmc/bin/evmc run --vm ./lib/libevmone.so,O=0,histogram $(xxd -p -c100000 < JUMPDEST_n0) Executing on Istanbul with 1000000 gas limit in ./lib/libevmone.so,O=0,histogram Histogram: ADD,255 JUMPI,255 JUMPDEST,521985 PUSH1,510 PUSH32,1 DUP1,255 all,523261 Result: success Gas used: 527598 ``` The benchmark has been measured to execute in 4397246 CPU cycles. This gives the _lower boundary_ of the `JUMPDEST` instruction cost: - 8 CPU cycles, - 3.65 ns execution time, - 0.37 gas cost. $$ T_{TOTAL} = 521985T_{JUMPDEST} + 255T_{ADD} + 255T_{JUMPI} + 510T_{PUSH1} + 255T_{DUP1} + T_{PUSH32} $$ $$ \forall op \in \{ADD,JUMPI,PUSH1,DUP1,PUSH32\}:T_{op} \geq T_{JUMPDEST} $$ $$ T_{TOTAL} \geq 523261T_{JUMPDEST} $$ $$ T_{JUMPDEST} \leq \frac{T_{TOTAL}}{523261} $$ The `JUMPDEST` estimation upper bound: $$ \overline{T}_{JUMPDEST} = \frac{T_{JUMPDEST/n0}}{523261} $$ To get lower bound assume (5 here is a guess based on observations) $$ \forall op \in Ops: T_{op} \leq 5T_{JUMPDEST} $$ what gives $$ T_{JUMPDEST} \geq \frac{T_{JUMPDEST/n0}}{528365} = \underline{T}_{JUMPDEST} $$ From this we have very high confidence (~1% error) for the true $T_{JUMPDEST}$ value. Furthermore, the range can be made arbitrary small by increasing the number of `JUMPDEST` instructions inside the loop. $$ \frac{\underline{T}_{JUMPDEST}}{\overline{T}_{JUMPDEST}} = \frac{523261}{528365} \approx 0.99 $$ ### Estimating cost of unary operators This is done using _.../u0_ kinds of benchmarks for which the loop iteration consists of 2046 unary operator instructions and 6 instructions related to the loop itself. This gives $$ T_{unop/u0} \geq 521730T_{unop} + 1531T_{JUMPDEST} \geq 521730T_{unop} + 1531\underline{T}_{JUMPDEST} $$ $$ T_{unop} \leq \overline{T}_{unop} = \frac{T_{unop/u0} - 1531\underline{T}_{JUMPDEST}}{521730} $$ ### Estimating cost of binary operators Binary operators consume more (2 values) than produce (1 value) so benchmarks are balanced with additional produces in form of `DUP1` instructions. Here we can use _b0_ and _b1_ benchmark kinds. Both kinds execute exactly the same instructions, just in different order. Execution consist of 523261 instructions including 260610 binary operators. To get the upper estimation of binary operator timings $\overline{T}_{binop}$ the timing of 262651 instructions other than the binary operator can be replaced with $\underline{T}_{JUMPDEST}$ $$ T_{binop/bN} \geq 260610T_{binop} + 262651T_{JUMPDEST} \geq 260610T_{binop} + 262651\underline{T}_{JUMPDEST} $$ $$ T_{binop} \leq \overline{T}_{binop} = \frac{T_{binop/uN} - 262651\underline{T}_{JUMPDEST}}{260610} $$ where $N \in \{0,1\}$. ### Estimating cost of PUSH/DUP TODO: This is very similar to binop, just balanced with `POP` instructions. The comment alone about it should be enough. ## Results https://docs.google.com/spreadsheets/d/1OyvFQZON3AfwGLPgHutOEi03DSBI6rR5h8QS1YNqEpQ/edit?usp=sharing TODO: Add table with cycles for Haswell and Skylake TODO: Add table with timings for ref CPU TODO: Add table with proposed gas costs, rounded to integers. TODO: Why only a subset of instruction was benchmarked? ### Comparing stack usage cases TODO: Section draft In this section we should explain why benchmark pairs: - MUL/b0 vs MUL/b1 - PUSH32/p0 vs PUSH32/p1 (only Haswell) - DUP1/d0 vs DUP1/d1 (only Haswell) indicate different results although execute exactly the same number of the same instructions just in different order. Here we will inspect following There are two classic explanations for this: 1. Different instruction orders affects branch predictor of the interpreter dispatch differently. This argument is dismissed in the next section: branch mispredictions are extremely low in all benchmarks. Collect perf data with ```shell= perf stat -e 'cycles,instructions,branches,branch-misses,L1-dcache-loads,L1-dcache-load-misses,br_inst_exec.all_indirect_jump_non_call_ret,br_misp_exec.all_indirect_jump_non_call_ret' bin/evmone-bench ../../test/benchmarks --benchmark_filter=baseline/execute/synth/PUSH32/p1 ``` perf data includes whole evmone-bench run, but `baseline_exececute()ec` function takes at least 98% of time. ### Interpreter dispatch prediction 1. Baseline uses this instruction to dispatch. ``` ff e1 jmpq *%rcx ``` 2. Decoding it with help of https://www.felixcloutier.com/x86/jmp it is ``` FF /4 JMP r/m64 Jump near, absolute indirect ``` 5. On Haswell we used the following CPU events to catch it: - `br_inst_exec.all_indirect_jump_non_call_ret` - `br_misp_exec.all_indirect_jump_non_call_ret` 6. CPU events manuals: https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/intel-processor-events-reference.html 7. These events are not available on Skylake. 8. These are not only "retired" but also "speculated" instructions. 9. However, - the `br_inst_...` matches the number of EVM instructions executed, - the `br_misp_...` matches 1x or 2x benchmark loop count, depending on the "mode". 10. Therefore the misprediction is only 0.1 or 0.05%. 11. TODO: Measurements should be repeated with fixed-iterations benchmarks. 12. TODO: Add table with misprediction rates (both all and br_inst) of all benchmarks. Conclusion: interpreter instruction misprediction is negligible. ## Appendix R: Reproduce results 1. Use `evm_benchmarking` branch. TODO: Add git tag. 2. Dump synthetic benchmarks bytecodes. ```shell > BENCH_DUMP=1 evmone-bench ``` 3. Produce instruction histogram: we want "main" and all instructions count. ``` > evmc/bin/evmc run --vm ./lib/libevmone.so,O=0,histogram --gas 100000000 $(xxd -p -c100000 < ADD_b0) Executing on Istanbul with 100000000 gas limit in ./lib/libevmone.so,O=0,histogram Histogram: ADD,260865 POP,255 JUMPI,255 JUMPDEST,255 PUSH1,510 PUSH32,1 DUP1,261120 all,523261 Result: success Gas used: 1570803 Output: ``` 4. Collect CPU stats, preferable with benchmark iterations fixed (TODO: provide event list). ``` perf stat -ddd --no-big-num -- bin/evmone-bench ../../test/benchmarks --benchmark_filter='baseline/execute/synth/ADD/b0' ``` ## Appendix I TODO: Table of all included instructions grouped by kind. ## Appendix B TODO: JSON / text file with benchmark results. They will not match spreadsheet results because for these we needed individual perf runs.