owned this note
owned this note
Published
Linked with GitHub
# Multi-Target Verkle-trie Branch Attacks
The Ethereum protocol plans to [upgrade](https://verkle.info/#332fc5e880bc452d96477abac0a2f164) from a Merkle tree to a Verkle trie. With the new trie structure, an attack that needs to be evaluated are trie grinding attacks, where an adversary who wants to carry out a denial-of-service attack writes data to the Verkle trie with an Ethereum adress and storage slot in such a way that the depth of trie grows beyond what we would expect for random writes. Accessing data that is stored at a high depth in this trie is more costly in various ways.
So this may cause gas mispricing and/or affect honest users' data that gets pushed to a high depth to be more expensive to access.
See the [analysis](https://hackmd.io/@jsign/vkt-tree-branch-attack) by Ignacio Hagiopan (Note: Work-in-progress document that may change) that goes a bit more into detail of the actual Verkle trie construction for more details why this matters.
We assume here that the reader is somewhat familiar with this.
The purpose of this document is to extend the above analysis to multi-target attacks and try to estimate their complexity.
**The TL;DR is as follows:**
Attacks that create lots of storage at depth $t=14$ seem doable for a dedicated attacker that is ready to spend millions of dollars. Consequently, **clients need to handle accesses at least at a depth of** $t=14$ **at the bare minimum**. A higher depth that clients should handle comfortably, say, $t=16$ is recommended.
### Verkle trie hashing function
Let us take a quick look at how the trie works: we have a hash function `get_tree_key` whose outtput that determines where something is stored in the trie. Notably, we have
`y = get_tree_key(x)`
where `x` consist of the Ethereum address and the tree index (with special handling of the last byte of the latter).
`y` is a 256-bit output. The trie is 256-adic, so we work on the byte level and treat outputs as a sequence of 64 bytes.
Let us denote by the branching distance $0\leq BD(y,y') \leq 64$ of two 64-byte values $y,y'$ the number of shared prefix bytes.
Data for $x_{target}$ will be stored at depth $\geq t$ in the Verkle trie, iff [^SpecialTop] there are other values $x_1,\ldots,x_t$ for which data is stored in the trie with
$BD(y_1,y_{target}) < BD(y_2,y_{target}) < \ldots < BD(y_t,y_{target})$
where $y_i := \mathrm{get\_tree\_key} (x_i)$
Notably, $x_{target}$ gets stored in a path determined by $y_{target}$ and then each of these $x_i$'s branches off from that path at a different level.
So an attacker who wants to attack a given target $x_{target}$ will *grind* values for $x_i$'s until values with a sufficiently long hash-prefix collision are found. In practice, we only care about the case $BD(y_i, y_{target}) = i-1$ for our analysis, as this is the easiest for an attacker.
For us, it is sufficient that the attacker has sufficiently much freedom in choosing the values $x_i$ and we will take no further heed in how these values relate to Ethereum adresses and tree indices.
For the rest of the analysis, we will treat `get_tree_key` essentially as a random oracle as far as its input-output behaviour is concerned.
#### Cost of a single hash
What is important for pracical considerations is the cost of evaluating `get_tree_key` itself. For the Ethereum Verkle trie, this is essentially a multi-exponentiation in the Bandersnatch elliptic curve group. Note however, that an adversary who just tries candidate $x_i$'s until suitable ones are found can actually iterate over *related* $x_i$ rather than purely random ones. If $x_i$ and $x_i'$ only differ in the lowest 128-bit word by $\pm 1$, we can compute (including relevant intermediate values) $y_i' = \mathrm{get\_tree\_key}(x_i')$ from (intermediate values calcuted for) $y_i = \mathrm{get\_tree\_key}(x_i)$ by a single elliptic curve group operation rather than a multi-exponentiation[^division].
So the (amortized) unit cost for a single evaluation of `get_tree_key` is essentially a single group operation, where one of the operands is fixed. This computation requires no precomputation tables stored in (processor-local) memory and is amenable to massive parallelization with ASICs.
In fact, the type of ASIC one would use in the grinding attack described below very strongly resembles the type of ASICs one would use to solve the ECDLP on the group using the attack due to van Oorschot and Wiener[^OW].
[^OW]: van Oorschot, Paul C., and Wiener, Michael J. 1999. Parallel Collision Search
with Cryptanalytic Applications. Journal of Cryptology, 12(1), 1–28.
### Multi-Target attacks
In this document, we consider the case where the attacker has a given *set* of targets $X_{targets} = x_1',\ldots, x_{S}'$ of size $S$. The attacker wants to find appropriate prefix collision for a subset of size $s$ of them: for $s$ targets $x'$ among the $S$ many, we want to find a set $x_1,\ldots,x_t$ as described above.
This attack scenario actually subsumes several attacks:
The set $X$ can consist of honest users' addresses+trie keys or $X$ can be chosen by the adversary themself. The latter corresponds (with $s=1$) to a birthday paradox-attack on a t-byte prefix of `get_tree_key`.
Attacking honest users' addresses+trie keys makes accesses to those keys more expensive.
For adversarially chosen $X$, this is still an issue if access cost is not part of gas price, because an adversary can just put their own keys into the trie and intentionally trigger accesses. This then is a potential DOS vector due to gas mispricing.
### Harder than collision
It should be noted that the problem we want to solve here is actually harder than a hash collision attack: For a given target $x'$ with $y'= \mathrm{get\_tree\_key}(x')$, we need to find $x_1,\ldots, x_t$ with $y_i := \mathrm{get\_tree\_key}(x_i)$ such that $BD(y_i, y') = i-1$. Clearly, the longer the shared prefix, the harder it is to find such $x_i$'s. So the hardest part is actually finding $x_t$. If we look at a single target and want to have good, constant success probability $p=1/2$, then by the time we found $x_t$, we will have found (on average) about $256\cdot p$ many good $x_{t-1}$'s, $256^2\cdot p$ many good $x_{t-2}$'s and so on. So the attack is completly dominated by finding $x_t$.
This picture changes drastically once $P := s/S$ is small, i.e. if the attacker is content with succeeding on only a small fraction of targets (remember those target might be attacker-generated): In this situation, to be overall successful, the success probability $p$ to find $x_t$ for any *given* $x'$ can be small. But then the probability that we find a good $x_{t-1}$ for the *same* $x'$ along the way is only $256p$, which may be significantly smaller than 1.
In fact, when `get_tree_key` is treated as a random oracle, the best (minimizing the number of `get_tree_key` evaluations) an adversary can do is grind $T$ many $x_i$'s until sets $\{x_1,\ldots, x_t\}$ with appropriate prefix collisions are found for $s$ many of the given targets.
For any given $x'$, this then means that the expected number of $x_i$'s with $BD(y_i,y') = i-1$ that we find is given approximately[^approx] by $P_i := \frac{T}{256^{i-1}}$.
If $P_i \geq 1$, we expect to find many solutions and we take them for granted. If $P_i < 1$, we treat this a success probability and estimate the overall success probability for a given $x'$ by
$$P := \prod_{i \text{ with } P_i < P} {P_i} = \prod_{i = 1 + \lceil \log_{256}T \rceil}^t \frac{T}{256^{i-1}}$$
We then have $s \approx P \cdot S$.
#### Actual Attack Design
The advantage for the attacker of having many targets $S$ is that one can plausibly consider an attack successful with $s\geq 1$ even if the success probability per target is low. However, as we have seen above, for a low success probability per target, the number of relevant factors in the product $\prod_{i = 1 + \lceil \log_{256}T \rceil}^t \frac{256^{i-1}}{T}$ grows.
This is a form of time-memory tradeoff. To have (at least) a single success $s=1$, the adversary needs $S \geq 1/P$. Since we need to store at least $P^{-1}$ targets, $P$ determinest (a lower bound for) the memory consumption of the algorithm. Decreasing $P$ allows to increase $S$ while keeping $s$ constant (increasing memory). This then reduces $T$ or gives a larger $t$.
For attacker-chosen $S$ and $T$ between, say $2^{80}$ and $2^{128}$, its turns out that the number of factors in the product that are "worthwhile" are around 2--3.
For example, with 3 factors $>1$ (and one factor $=1$), we have $P = \frac{1}{256\cdot 256^2\cdot 256^3} = 2^{-48}$ and $T = 256^{t-4}$, so $t= 4+ \log_{256}T$. An additional factor $\frac{1}{256^4}$ would reduce $P$ to $2^{-80}$.
For comparison, the attack against a single target with $S=s= P = 1$ has $t=1+\log_{256} T$
We are not forced to set to a power of $256$, so we can use intermediate values. The following table gives some reasonable choices of $P$.
| Computation of $P$ | $P$ | $t$ |
| ------------------ | ----- | -------------------- |
| 1 | $2^0$ | $t = 1+ \log_{256}T$ |
| $\frac{1}{2^1}$ | $2^{-1}$ | $t= 1\tfrac{1}{8} + \log_{256}T$ |
| $\frac{1}{2^2}$ | $2^{-2}$ | $t= 1\tfrac{2}{8} + \log_{256}T$|
| $\frac{1}{2^3}$ | $2^{-3}$ | $t= 1\tfrac{3}{8} + \log_{256}T$|
| $\frac{1}{2^4}$ | $2^{-4}$ | $t= 1\tfrac{4}{8} + \log_{256}T$|
| $\frac{1}{2^5}$ | $2^{-5}$ | $t= 1\tfrac{5}{8} + \log_{256}T$|
| $\frac{1}{2^6}$ | $2^{-6}$ | $t= 1\tfrac{6}{8} + \log_{256}T$|
| $\frac{1}{2^7}$ | $2^{-7}$ | $t= 1\tfrac{7}{8} + \log_{256}T$|
| $\frac{1}{2^8}$ | $2^{-8}$ | $t= 2 + \log_{256}T$|
| $\frac{1}{2^9}\cdot \frac{1}{2^1}$ | $2^{-10}$ | $t= 2\tfrac{1}{8} + \log_{256}T$|
| $\frac{1}{2^{10}}\cdot \frac{1}{2^{2}}$ | $2^{-12}$ | $t= 2\tfrac{2}{8} + \log_{256}T$|
| $\frac{1}{2^{11}}\cdot \frac{1}{2^{3}}$ | $2^{-14}$ | $t= 2\tfrac{3}{8} + \log_{256}T$|
| $\frac{1}{2^{12}}\cdot \frac{1}{2^{4}}$ | $2^{-16}$ | $t= 2\tfrac{4}{8} + \log_{256}T$|
| $\frac{1}{2^{13}}\cdot \frac{1}{2^{5}}$ | $2^{-18}$ | $t= 2\tfrac{5}{8} + \log_{256}T$|
| $\frac{1}{2^{14}}\cdot \frac{1}{2^{6}}$ | $2^{-20}$ | $t= 2\tfrac{6}{8} + \log_{256}T$|
| $\frac{1}{2^{15}}\cdot \frac{1}{2^{7}}$ | $2^{-22}$ | $t= 2\tfrac{7}{8} + \log_{256}T$|
| $\frac{1}{2^{16}}\cdot \frac{1}{2^{8}}$ | $2^{-24}$ | $t= 3 + \log_{256}T$|
| $\frac{1}{2^{17}}\cdot \frac{1}{2^{9}}\cdot \frac{1}{2^{1}}$ | $2^{-27}$ | $t= 3\tfrac{1}{8} + \log_{256}T$|
| $\frac{1}{2^{18}}\cdot \frac{1}{2^{10}}\cdot \frac{1}{2^{2}}$ | $2^{-30}$ | $t= 3\tfrac{2}{8} + \log_{256}T$|
| $\frac{1}{2^{19}}\cdot \frac{1}{2^{11}}\cdot \frac{1}{2^{3}}$ | $2^{-33}$ | $t= 3\tfrac{3}{8} + \log_{256}T$|
| $\frac{1}{2^{20}}\cdot \frac{1}{2^{12}}\cdot \frac{1}{2^{4}}$ | $2^{-36}$ | $t= 3\tfrac{4}{8} + \log_{256}T$|
| $\frac{1}{2^{21}}\cdot \frac{1}{2^{13}}\cdot \frac{1}{2^{5}}$ | $2^{-39}$ | $t= 3\tfrac{5}{8} + \log_{256}T$|
| $\frac{1}{2^{22}}\cdot \frac{1}{2^{14}}\cdot \frac{1}{2^{6}}$ | $2^{-42}$ | $t= 3\tfrac{6}{8} + \log_{256}T$|
| $\frac{1}{2^{23}}\cdot \frac{1}{2^{15}}\cdot \frac{1}{2^{7}}$ | $2^{-45}$ | $t= 3\tfrac{7}{8} + \log_{256}T$|
| $\frac{1}{2^{24}}\cdot \frac{1}{2^{16}}\cdot \frac{1}{2^{8}}$ | $2^{-48}$ | $t= 4 + \log_{256}T$|
| $\frac{1}{2^{25}}\cdot \frac{1}{2^{17}}\cdot \frac{1}{2^{9}} \cdot \frac{1}{2^{1}}$ | $2^{-52}$ | $t= 4\tfrac{1}{8} + \log_{256}T$|
| $\frac{1}{2^{26}}\cdot \frac{1}{2^{18}}\cdot \frac{1}{2^{10}} \cdot \frac{1}{2^{2}}$ | $2^{-56}$ | $t= 4\tfrac{2}{8} + \log_{256}T$|
| $\ldots$ | $\ldots$ | $\ldots$ |
As we can see, we need to increase $P^{-1}$ by $2^{\#\text{factors in the product}}$ (i.e. memory) to gain a factor two on $T$ (i.e. on time). For plausible attack budgets, this seems not cost-effective (compared to incresing $T$ by parallelizing the work more) far beyond $2^{50}$. The optimum is likely much lower.
Additionally, with a lower $P$, since we need more targets $S$, we also need more time to actually check a grinded $x_i$ against the candiate targets. With appropriate data structures (and if we check all candidates), this check is $O(\log S)$ random memory accesses. In addition, each memory access gets more expensive as memory grows.
If we are willing to spend time $T>2^{80}$ anyway, but the needed number of targets per success is $P^{-1} < 2^{50}$, then in the attacker-chosen target scenario with $S=2^{80}$, we get a large number $>2^{30}$ of successful attack targets. If we are happy with getting a lower number of successful targets, we can pre-filter the targets.
The pre-filtering attack works concretely as follows: Pick some $\ell > 0$.
1. Create $S$ (random) candidate targets $X$.
2. Only retain $\tilde S$ targets $\tilde X$ where the first $\ell$ bytes have a particular (arbitrary) value. We have $\tilde S = 256^{-\ell} S$.
3. For each $1 \leq i \leq \ell-1$, find an $x_i$ such that $BD(y_i, y') = i$ for $y_i = \mathrm{get\_tree\_key}(x_i), y'=\mathrm{get\_tree\_key}(y), x'\in \tilde X$. Note that this does not depend on the particular $x' \in \tilde X$ chosen as all resulting $y'$ share a prefix of $\ell$ bytes.
4. Grind $T$ candidate values $x$ and compute $y=\mathrm{get\_tree\_key}(x)$.
4.1. Check if $y$ matches the bit pattern on the first $\ell$ bytes.
4.2. If yes, check whether $BD(y,y')$ is small for any $y' = \mathrm{get\_tree\_key}(x')$ for $x'\in \tilde S$.
4.3. If yes, store it along the the prefix length $BD(y,y')$ for this $x'$.
4. Output any $x'\in\tilde{S}$ and corresponding $x_i$'s for each $x'$ where the attack is successful, which in this context means we found a hash-prefix collision against $x'$ with $i$ bytes for each $1 \leq i \leq t$.
(This includes the elements $x_i$ from step 3. These may be shared among the $x'$)
Of course, we do not need to ever store all elements from $X$; we only store $\tilde X$.
The idea about pre-filtering is that an ASIC that grinds the $x$ can locally filter out bad solutions quickly. In this setting, step 4.1. is done on the ASIC and 4.2. and 4.3. are done on dedicated machines with memory. This way, the communication and memory access only needs to occur for a $256^{-\ell}$ fraction of candidate $x$'s.
With even a modest value for $\ell$, the attack is massively parallelizable and the dominating cost is indeed given by $T$ many group operations. Note that the (precomputation) step 1+2 can in fact be merged with 4+5, but the description above is more clear. Step 3 has neglible cost provided $\ell$ is set such that $\tilde{S} \geq 1$ with high probability.
Observe that the success probability (before filtering) per candidate value $x$ to get a lenght-$i$ hash-prefix collision with any $x'\in \tilde{X}$ is not changed for $i\geq \ell$. Of course, after the filtering, we always have a prefix collision of at least $\ell$ bytes by construction, which is why step 3 is needed.
In the setting described above with $P=2^{-48}$ and $S\geq 2^{80}$, the attacker can essentially select whether to spend the difference on a high number $s$ of successful targets or on pre-filtering $\ell$.
### Implications and Concrete Parameters
Assume an attacker has time $T$ and wants to create a sufficient number $s$ of braches at depth $t$ in the trie. Later, the attacker randomly accesses these branches for a DOS attack. Since the attacker has to actually write enough data to the trie (read/write access is costly and rate-limited), it does not make sense to look into huge values of $s$. For concreteness and to get nice values, we consider $s=2^{16}$, which is more than enough.
By the above analysis, the value of $t$ than an attacker can reach with reasonable memory requirement $P^{-1}\cdot s$ is approximately (give or take an additive $\pm 1$) given by $t \approx 4+\log_{256}T$ for $P=2^{-48}$. Changing the concrete value of $P^{-1}$ and $s$ within reasonable bounds does not change $t$ by much. It is best to use $S=T$ and then the value $\ell$ is determied by $\log_{256}(T\cdot P) - \log_{256}s$. For any reasonable $T\geq 2^{80}$, this leads to $\ell \geq 2$ (possibly much larger), which means that communication cost is negligible.
For instance, with $T=2^{128}, P=2^{-48}, \ell = 8$, we get $s=2^{16}$ and $t=20$, with an attack requiring $\approx 2^{64}$ (rarely accessed) memory. With the same $T$ and $s$, we can reduce this to $2^{40}$ memory and $t=19$.
With $T=2^{80}$, $P=2^{-48}, \ell = 2, s=2^{16}$ we get $t=14$ with $2^{64}$ memory. Again, the memory can be significantly lowered at the expense of $s$ and $T$.
As a consequence, I suggest estimating the trie depth that clients need to be able to handle (with lots of accessess to different slots) to be $4+\log_{256}T$, where $T$ is the maximum number of group operations a well-funded attacker can afford to compute using ASICs. For cryptographic levels of security approximately matching the security level of the curve used (T \approx 2^{127}$), this means $t=20$. Since this is "only" a DOS vector that we could recover from, we might not need a that high level of security. So for a more economic point of view, an electricity budget of about 1 million USD corresponds to a (ballpark estimate, taking very rough numbers from bitcoin hashes -- ignoring the fact that the hash operation is more expensive here) $T\approx 2^{80}$, which would give $t=14$. Note that the R&D effort to design and the cost to manufactore such ASICs might potentially be amortized with effort spent to built ASICs for the ECDLP via[^OW].
It is thereby recommended that clients should be able to handle **at least** $t=14$, ideally $t=16$ or more.
[^approx]: The given formula is valid for a prefix-collision lenght $\geq i$ rathen than $=i$.
[^SpecialTop]: Technically, the specification as written treats the top level differently, so the top level of the trie would always branch, even if the $y$'s for all entries stored in the trie had the same first byte. This is a non-issue. We also assume that no actual full collision for `get_tree_key` occurs, i.e. we don't use $x\neq x'$ with $\mathrm{get\_tree\_key}(x) = \mathrm{get\_tree\_key}(x')$.
[^division]: Technically, we also need to perform a `map_to_field` operation. This is another division in the underlying base field; we treat this as part of the group operation here, which is appropriate because projective point addition + `map_to_field` has exactly the same cost as the normal affine point addition. This is becaue the `map_to_field` operation can work with projective coordinates as input. Essentially, `map_to_field` can be seen as a normalization to some form of affine coordinates and taking one of the coos, just not the usual one (we set `y:=1` rather than `z:=1`)