# Data availability sampling in practice If we want to move beyond relying on committee assumptions in eth2, we need to have a robust infrastructure for data availability sampling. This post attempts to discuss what this would look like in the eth2 context in practice. For simplicity, we will assume a fraud-proof-free data availability sampling technique - either Kate commitments or Merkle proofs with a SNARKed Merkle root. Kate commitments are likely significantly more efficient. ### Recap: how does data availability sampling work? _Note: actual implementation differs somewhat from this exposition, which optimizes for simplicity_ Take a piece of data $D$, of size $N = |D|$ chunks (each chunk is ~32 bytes). Using [Lagrange interpolation](https://brilliant.org/wiki/lagrange-interpolation/) we can compute a degree $< N$ polynomial $P$ that represents $D$ as its evaluations; that is, where $P(0) = D[0]$, $P(1) = D[1]$, $P(2) = D[2]$ and so on up to $P(N-1) = D[N-1]$. We generate a commitment to $P$ that can be used to verify the evaluations of $P$ at the positions $[0... 2N-1]$ - the $N$ positions containing the data but also $N$ additional positions. You can think of this commitment as a Merkle root of all $2N$ evaluations of $P$ plus a SNARK proving that it actually is encoding the results of the same degree $< N$ polynomial. In practice, [Kate commitments](https://dankradfeist.de/ethereum/2020/06/16/kate-polynomial-commitments.html) may work better. Now, you may be asking, why do we encode these extra redundant $N$ evaluations of the polynomial? The answer is this: if you have $N$ evaluations, _any_ $N$ evaluations, of a degree $< N$ polynomial, you can use [Lagrange interpolation](https://brilliant.org/wiki/lagrange-interpolation/) (or potentially [faster FFT-based techniques](https://ethresear.ch/t/reed-solomon-erasure-code-recovery-in-n-log-2-n-time-with-ffts/3039)) to recover the polynomial, and hence recover all of the other evaluations. As long as at least half the evaluations are available, all of the data is available. This reduction from needing 100% to only needing 50% allows us to use random sampling to probabilistically verify that a block has been published without downloading the whole thing. A client randomly selects $k$ indices in the range $[0, 2N-1]$, and sends a message to the network with these indices. The network responds with a proof (think of this as a Merkle proof) for each index $i$ proving what the value of $P(i)$ is. When the client sees that all of their queries have been answered, they assume that the block data really has been published. This assumption is backed by the following reasoning: 1. We assume that there are enough clients out there that fully responding to all clients requires publishing more than half the data, allowing any single honest node to recover and republish the rest of the data. 2. If less than half the data has been published (and so the block cannot be reconstructed), then it's overwhelmingly likely that at least one of the client's indices will correspond to a piece of data that has not been published, and so the client will reject the block. ### How do we implement this? The main challenge that we have in the current setup is that blocks in each shard are on a different subnet, and so doing the sampling procedure directly would require each client to be on all subnets, which is untenable. ### Approach 1: find an intermediary for each shard A client can locate an intermediary node for each shard; that is, for each shard, it can seek out in the beacon chain subnet a node that is also on that shard. The client would then route all requests related to that shard through that node. The main challenge with this approach is that it requires clients to have quite a lot of peers. Note that 64 peers (1 per shard) is not enough, to account for the risk that the peer will be dishonest. You may need eg. 16 peers per shard (so 1024 peers total). If intermediary nodes are willing to be on multiple subnets (eg. 10 subnets each), this goes back down to ~102 total, but then each intermediary node's load is higher. ### Approach 2: a subnet per index We create $N$ subnets, where $N$ is the number of chunks in a block (or we can put multiple adjacent chunks into one subnet to decrease the subnet count). A client would choose a persistent set of indices that it always searches for, and join the subnets corresponding to those indices. The main challenge with this approach is that it makes it harder for clients to change the indices that they use, making it potentially easier to trick specific clients. However, it's not _too_ bad: clients can still continually switch to new subnets as quickly as they can and to alter which indices they ask for. Another challenge is that there would need to be a procedure for rebroadcasting each chunk of each block from each shard to each subnet, which seems to suggest a high minimum on the number of nodes that must be in the network so that each (shard, index) intersection has a corresponding node. ### Approach 3: multi-level intermediation A client makes sure to be connected to at least ~16 aggregator nodes. Each aggregator node specializes in maintaining contact with at least one intermediary per shard. The request now gets passed: ``` client -> aggregator -> intermediaries -> shard_network ``` And then the response gets passed along the same path in reverse. This has the following benefits: * The client now needs to only maintain a few peers * The aggregators don't need to maintain too many peers; they just need 1 peer per shard. If the peer of one aggregator is dishonest, then that's what the client talks to multiple aggregators for. "Aggregator" does not need to be a specialized role; potentially, regular clients can promote themselves to aggregators if they have the p2p bandwidth and manage to find at least one intermediary per shard. This avoids the issues with approaches (1) and (2), at the cost of increasing the number of rounds of network latency from 2 to 3. Hence it is this author's current preference.