-
-
Published
Linked with GitHub
# PeerDAS design progress at the interop
Some meaningful changes in the PeerDAS design have been discussed, and agreed upon, at the interop:
- Do not use the tight fork-choice in the first iteration of PeerDAS
- Increase custody requirement and subnet count
- Introduce validator custody
- Take peer sampling completely out of the critical path of consensus
In my opinion, they represent both a huge simplification and an increase in robustness of the design. Let's break down what these points mean and why that is the case.
## Trailing fork-choice instead of tight fork-choice
Firstly, remember the difference between the *tight* and *trailing* fork-choice, also discussed [here](https://ethresear.ch/t/das-fork-choice/19578): with the first, we require peer sampling to be satisfied in order to vote for a new block proposal, whereas with the latter we only require it to vote for older blocks. In other words, with the trailing fork-choice we only start taking peer sampling into account when determining the availability of blocks after some time, e.g., one or multiple slots, with the goal of allowing enough time for it to happen, without getting in the way of voting. Essentially, the trailing fork-choice attempts to take peer sampling out of the critical path of consensus, making it harder for it to cause liveness failures.
At the very least, it seems prudent to start by using a fork-choice where peer sampling trails by one slot, because the tight fork-choice has very strict timing requirements, and many ways they might not be satisfied, both due to network failures or malicious behavior. With the tight fork-choice, if any one of the peers you have requested a sample from does not respond before the attestation deadline, you're unable to vote for the latest proposal. On the other hand, if the timing constraints are sufficiently relaxed, it is ok for a few queries to time out, as you only need one to eventually succeed.
There are ways to make peer sampling more robust even on tighter timelines, such as [LossyDAS](https://ethresear.ch/t/lossydas-lossy-incremental-and-diagonal-sampling-for-data-availability/18963), which introduces some tolerance of sampling failures while keeping the same soundness error, by increasing the number of sampling queries and requiring more of them to succeed. For example, instead of requiring 16 out of 16 queries to succeed, we can require 19 out of 20 successes, or 21 out of 23. Still, there is much work to do before we can confidently say that peer sampling achieves the same level of reliability under tight timing constraints as mesh diffusion, which is currently what we rely on for the critical path of consensus. We can keep working on this and test out different strategies in production once peer sampling is already on mainnet, but for the first iteration it seems much safer to stick to the trailing fork-choice.
## Increase custody requirement and subnet count
Once we have agreed to go with the trailing fork-choice, even with the smallest possible trailing period of one slot, there are many downstream consequences to other design decisions. Firstly, [the trailing fork-choice exacerbates some existing ex-ante reorg attacks](https://notes.ethereum.org/P1iDee8lTwyAtHpZwd8LMw?view#With-the-trailing-fork-choice), *unless we set a sufficiently high custody requirement*. The intuition here is that votes cast before doing peer sampling are "insecure", in the sense that they might be well be for an unavailable block, and this opens us up to fork-choice attacks. That is, unless we have some other way of gauging availability before doing peer sampling, which gives us sufficient guarantees.
This other way is to consider something available as long as we can retrieve everything that we are custodying. Another way to think about it, given that custody happens through subnets, is that we consider something available before doing peer sampling as long as *subnet sampling* is satisfied. For this to give us meaningful security guarantees, we have to sample enough, meaning we have to have a sufficiently high custody requirement. With that, the trailing fork-choice becomes essentially a hybrid of [SubnetDAS](https://ethresear.ch/t/subnetdas-an-intermediate-das-approach/17169) and PeerDAS, utilizing the former during the trailing period and the latter afterwards.
Once we have increased the minimum custody requirement (say to 8, which provides sufficient guarantees), it also makes sense to increase the number of subnets so as to keep the ratio of data custodied by each node to total data low. For example, with a custody requirement of 8 and 128 subnets, each node would download 1/16 of the extended data, or 1/8 of the original data. This means that we can increase the blob count up to 8x, without increasing the bandwidth required by each node (here ignoring what is required by peer sampling, which is much less than what is required for custody, since it doesn't suffer from gossip amplification). Were we to keep the number of subnets to 32 as in the first version of the PeerDAS spec, each node would instead be downloading 1/2 of the original data, and we would hardly get any scalability benefits. We have agreed to increase the number of subnets to at least 64, and possibly even to 128.
## Introducing validator custody
Setting a higher custody requirement, sufficient to give meaningful security guarantees before peer sampling is performed, is a decision that's entirely centered around attesting and proposing, i.e., validator duties. *It is not actually beneficial for full nodes*, which do not have such duties. Therefore, we introduce validator custody, i.e., make the default behavior of nodes with validators attached be to custody more than full nodes. We can then also set the general custody requirement to be lower, as this is only driven by the need to provide a backbone for the subnets and a minimum level of data diffusion throughout the network, as a baseline for peer sampling. In particular, we can do the following:
- Each full node participates in only 1 out of 32 of the subnets (e.g., set the custody to 4 if we go with 128 subnets), and therefore only downloads 1/16 of the original data.
- The baseline custody requirement for a node with at least one validator attached is 6. We also assign additional custody beyond the minimum 6, based on the total balance of all validators that a node has attached. For example, a node could participate in one extra subnet for every 16 ETH of balance, mapping to two extra subnets per pre-MaxEB validator. This has two main implications:
- A node with at least one validator (above the minimum activation balance) attached custodies a minimum of 8 subnets, which ensures a good minimum level of security when performing validator duties
- With 128 subnets, a node with ~2000 ETH would attempt to download all of the columns, and thus be able to reconstruct whenever it is at all possible to do so.
These choices give us a few benefits. Firstly, while full nodes without validators are still contributing to the network by doing some custody, they are cheaper to run than if we were to set their minimum level of custody based on what is required for validator duties. Moreover, scaling the validator custody with number of validators lets us take advantage of the fact that we do not have one million independent validators, and rather 10 to 20 thousand nodes, many of which have 100s or 1000s of validators attached. Such nodes would by default be supernodes, downloading all of the data, which seems entirely appropriate for a node whose associated stake is worth 10s or 100s of millions of dollars. For these nodes, availability essentially works as in 4844, because they do not do any sampling: either they can get all of the data, or they do not consider it available. As far as they are concerned, system can actually even be a bit more robust than with 4844, because they can reconstruct the whole data as soon as any 50% is available, so that localized failures or slowdowns of some subnets does not affect them. *Since such nodes hold a vast majority of the stake, neither subnet nor peer sampling will have much of an impact on consensus in practice*, making the system nearly as robust as what we have today.
## Taking peer sampling out of the critical path
By adopting the trailing fork-choice, we have already deferred peer sampling and prevented it from affecting the most time critical consensus tasks. *In fact, there's very little reason not to go further and remove almost any influence it has consensus.* In my opinion, this represents a huge simplification and a major de-risking of the move from 4844 to PeerDAS.
### Using peer sampling for transaction confirmation, not for consensus
Let's first consider a system where validators only use their custody check to determine whether something is available, i.e., SubnetDAS. If there is an honest majority of validators, everything works out, because all but a small percentage of honest validators will never vote for an unavailable block, as their custody check would fail. Consensus can stay safe and live (with the right fork-choice design, see [here](https://notes.ethereum.org/P1iDee8lTwyAtHpZwd8LMw?view)) and we can be sure that no unavailable block will ever be finalized.
On the other hand, what happens if we don't have an honest majority? Then, we of course don't have any security guarantees about consensus. Still, we want to guarantee that full nodes do not accept unavailable blocks, just like they do not accept invalid blocks. *This is precisely where peer sampling finds its value*, because it is cheaper (no gossip amplification, so we can do more sampling) and makes it a bit harder to target specific nodes (the queries are not all public as with subnet sampling). We can reintroduce peer sampling by having it just be part of transaction confirmation, without affecting the fork-choice, and thus neither attestations nor proposing. In other words, *without it affecting consensus at all*. Furthermore, we actually only need to require peer sampling for finalized blocks, because confirmation on non finalized blocks anyway relies on honest majority.
### Reintroducing peer sampling in a limited capacity
The system we have described is very appealing, because full nodes get all the security guarantees of peer sampling, while consensus is completely untouched by it. In particular, liveness of consensus only requires subnets to work reliably. Still, it might make sense to employ peer sampling in an extremely limited capacity within the consensus protocol, namely for justification. Meaning, a node would not consider a checkpoint to be justified if it does not pass the peer sampling check. Luckily, justification takes several minutes, so the timing constraints here are very loose.
By doing this, we prevent an honest node from voting with a source checkpoint which is not available, even if their custody check is satisfied. For example, consider the scenario where there's a malicious supermajority which justifies an unavailable block, and also targets some validators by releasing only the columns they are custodying. Such validators would be tricked into voting with this unavailable justified checkpoint as source, and would subsequently unable to switch to a minority fork without slashing themselves. They would instead be stuck on the malicious fork, and be inactivity leaked on any other.