EVM384 Security Analysis: Potential Gas DoS Attacks

# EVM384 Security Analysis: Potential Gas DoS Attacks This is a living document to analyze potential runtime amplification attacks on EVM384 opcode gas costs. It remains an open question whether attacks are possible. Recall that EVM384 proposes three new opcodes, `ADDMOD384`, `SUBMOD384`, and `MULMODMONT384`. Each EVM384 opcode pops one stack item from which it reads four packed memory offsets, each 32-bits. The offsets are to (i) two input 48-byte values, (ii) one input 48-byte modulus and, for `MULMODMONT384`, a 8-byte parameter concatenated, and (iii) an output of 48-bytes. For more information, the most comprehensive list of EVM384 links is in [this magicians thread](https://ethereum-magicians.org/t/evm384-feedback-and-discussion/4533). Comparing EVM384 to EVM, the closest opcodes in terms of memory use are `MLOAD`, `MSTORE`, `SHA3`, and `*COPY`. But these each access one memory offset, and a possibly long sequential array. There are no opcodes which access four memory offsets like EVM384 opcodes do. Perhaps an attack can exploit this many memory accesses. If all accesses are cache misses, then execution may be slow. Moreover, the bytecode itself can be a cache miss, resulting in a slowdown to read the next opcode. This can be further amplified if each cache miss was also a TLB miss, which results in an expensive page walk to resolve the virtual -> physical address mapping. In this security analysis, CPUs of interest have L1 cache of at least 64KB, possibly more cache layers until a final cache layer of at least 4MB, and a memory of at least 1GB. We assume that cache lines are 64 bytes, which is common. We will craft EVM bytecode which forces cache misses. Some notes to consider when crafting our bytecode: - L1 cache accesses are the fastest, and there is a slowdown to access each higher layer of cache and memory. A nice visualization of this slowdown is [here](https://uops.info/cache/lat_combined.html) (see the jumps at each cache layer). - Cache lines, 64 bytes for us, are the atomic unit for an object in cache, i.e. the unit that is brought in or evicted from cache. - To maximize cache pressure, each access may be across two cache lines. Also, we read integers across cache line bounderies which is slower on some processors. - In 24kb of bytecode, we can fit 1365 unique invocations to `ADDMOD384` (each with 18 bytes of bytecode: PUSH16 with its immediate plus the EVM384 opcode). It is possible to fit more, but not much more at the same cost, and we focus on simplicity for now. - It costs 5460 gas for 1365 `ADDMOD384` invocations (since 2 gas for `PUSH16` and 2 gas for `ADDMOD384`). - The gas cost of memory expansion grows quadratically, see the cost formula `Cmem()` in the yellowpaper appendix H.1, and the table below. ## Attack: Cache Level Thrashing Our first attack creates cache misses in L1 cache, and possibly other layers of cache. Consider a single contract with a loop containing as many `ADDMOD384` as possible, each accessing unique memory offsets with respect to the rest of the loop body. More specifically, we access across two two cache lines, as mentioned above. We use `ADDMOD384` because `SUBMOD384` would give similar results. We also try `MULMODMONT384` instead of `ADDMOD384`. ___ ``` ALGORITHM LoopOverADDMOD384: for 2250 iterations: PUSH16 0x... ADDMOD384 PUSH16 0x... ADDMOD384 . . (1365 times) . PUSH16 0x... ADDMOD384 ``` **Algorithm 1:** A loop over as many `ADDMOD384` as possible within code size limits and gas limits. Each PUSH16 has an immediate which has the packed offsets, and each offset is unique with respect to all other offsets in the loop body. ___ We observe the following parameters and constraints. - For `ADDMOD384`, We can do around 2250 loop iterations with a 12,500,000 gas limit. For `MULMODMONT384`, we can do around 1300 loop iterations. These numbers depend on gas costs and gas limits, but the changing these numbers should not change conclusions. - We use 698880 bytes of memory (=1365 * 4 * 128, since 1365 unique invocations, each accessing four offsets, each offset to 128 bytes which spans two cache lines). This memory expansion costs 997220 gas, a one-time cost amortized over all loop iterations. Because this cost is fixed and not dominant, we ignore this cost in our analysis. Our runtimes for this attempted attack follow. --- | | fixed_offsets runtime | random_offsets runtime | slowdown | | --- | --- | --- | --- | |`ADDMOD384` | 65 ms | 118 ms | 1.82x | |`MULMODMONT384`| 88 ms | 145 ms | 1.65x | **Table 1.** Costs of Algorithm 1 with `ADDMOD384` and `MULMODMONT384`, with cases for fixed offsets and random offsets. Code to reproduce these numbers is [here](https://gist.github.com/poemm/8213eebb26e546cb9353fe1fda539e1b), measured on i5-5300u. --- This attack gives a 1.82x slowdown. Perhaps with some tuning, this attack may be made worse. To reduce this attack, we consider reducing the memory offsets from 32-bits to 16-bits each, which would decrease addressable memory to 64kb, which would make this attack less effective. ## Attack: Cache Thrashing This attack seeks to entirely evict cache to memory, making each memory access a cache miss. Our experimental bad-case for each memory access being a TLB miss and cache miss is 100ns, measured on an i5-5300u running https://github.com/torvalds/test-tlb.git then `make run` and look at the last output number. Below is a table of costs to expand memory. Notice that the cost grows quadratically in the size of memory. ``` memory size memory expansion gas cost number of unique ADDMOD384 accesses total gas for ADDMOD384s 1kb 101 6 18 2kb 203 12 36 4kb 419 25 75 8kb 900 51 153 16kb 2053 102 306 32kb 5127 204 612 64kb 14347 409 1227 128kb 45075 819 2457 256kb 155683 1638 4914 512kb 573507 3276 9828 1MB 2195587 6553 19659 2MB 8585475 13107 39321 4MB 33948163 26214 78642 ``` Notice: - Memory expansion cost dominates `ADDMOD384` cost, so the cost per cache miss is expensive. We must amortize this memory expansion cost. - It is not possible to grow memory past 4MB within current block gas limits, so we need another way to apply 4MB of cache pressure. - Evicting cache by expanding memory is very expensive. So we may consider evicting memory by growing the stack, which costs two gas per 32-bytes using opcodes like `ADDRESS`. We sketch the following attack algorithm which first expands EVM memory once, then loops over evicting this EVM memory from cache and accessing the evicted memory with all cache misses. ___ ``` ALGORITHM EvictMemoryThenADDMOD384CacheMisses: Grow memory loop: CALL MemoryEvictor (at this point, our memory is evicted) PUSH16 0x... ADDMOD384 PUSH16 0x... ADDMOD384 . . (1365 times) . PUSH16 0x... ADDMOD384 ALGORITHM MemoryEvictor(n) ADDRESS ADDRESS . . (1024 times) . ADDRESS if n<128 MemoryEvictor(n+1) ``` **Algorithm 2:** The first algorithm calls `ADDMOD384` many times over memory locations evicted from cache, all cache misses. The second algorithm evicts memory by growing the stack with ADDRESS, which costs two gas each to touch 32 bytes, touching a total of 32KB. This is done recursively 128 times until we touch 4MB, or all of cache. ___ Some notes. - 32kb of stack items are touched 128 times, for a total of 4MB. (Actually, we may use less than 128 since there is other pressure on cache. But we use 128 to be sure.) - Once the cache eviction is done, we can do 1365 calls to `ADDMOD384` with each memory access a cache miss. - The cost of each memory evictor is 2748 (700 gas for each call, plus two gas for each `ADDRESS`). The cost of 128 memory evicters is 351744 (=128\*2748) gas. - The effective cost of each `ADDMOD384` (including cost to evict its memory locations) is 262 (=(5460+351744)/1365) gas. But our worst-case for cache misses is four times 100 ns, which would be 40 gas. This attack is ineffective because the cost to evict cache dominates. ## Conclusion From our experiments, we observe that modern hardware is designed to hide memory latency by allowing concurrent memory access operations whenever possible (see https://github.com/lemire/testingmlp.git with `make` then `./testingmlp` to measure the concurrency speedup). This, along with the interpreter overhead calming contention for cache and memory hardware, greatly reduce potential of an attack. One takeaway from this analysis is that packing 32-bit memory offsets allows accessing arbitrary memory locations. If we limit it to 16-bit memory locations, then we are limited to 64KB of memory, so any attack on such small memory will be quickly exhausted. It remains an open question whether other attacks are possible. We solicit independent security analysis.

Read more

EIP Draft: EVM384 and Friends

EVM384 Specs

EIP Draft: Replace alt_bn128 Precompiles With EVM Bytecode

EVM384 Update 5: First Gas Cost Estimates