Gossip data sampling idea

# Gossip data sampling idea For context, the write-up of Vitalik describing the problem: [Data availability sampling in practice](https://notes.ethereum.org/@vbuterin/r1v8VCULP) This write-up is a variant that expands on approach 2, by @protolambda. This is all experimental, there may be holes/weaknesses, feedback welcome. TLDR of context: - Shard Proposers publish block data somewhere, optionally split somehow - Listeners want to make sure the data is available - Ignore the attestation data bit(s) and availability proofs for now. General idea is: - Request any tiny chunk - Enough random distinct tiny chunks can reconstruct the original data, like error correction - The correctness of these pieces can be shown individually - Request `k / (2N + 1)` of the error correction data, `N` being the count matching the original input. - Random sampling + network effect -> enough to trust data is available with `k` requests ## Desiderata - Low latency, sub-slot time preferably - Stable, current attestation subnet are at the brink of requiring too much subnet movement - Random choices for sampling, less predictable is better - No new network stack components ## General approaches - DHT based: slow, stable, pull-based - Gossip based: fast, difficult, pubsub-based (push, but gossip, not direct) - RPC streams: slow, peering / fan-out complexity, inventing a new system ## The idea Improve upon approach 2 of original writeup, chosen because: - Gossip is fast - Gossip can be stable (current approach 2 is not so much) - Random sampling is possible (try and exploit gossip approach here) - Gossipsub is already widely integrated in Eth2 Although the other approaches / options have better sampling, this approach seems more viable, and we can try to improve sampling still. ### How Like approach 2, the chunks are mapped to gossip subnets, and reach the validators. Different now: try to move work from subscriber to publisher. Additionally, we try shuffling the mapping between chunk index and subnet index for each round. This doesn't add much to randomness, but is a start. To repeatedly and quickly get `k` random samples, you can now stay on a random set of `k` subnets. Each subnet is processing a new random chunk index (or subset of all indices) each round. A proposer needs to put the chunks on each subnet, but this is a one-of task, that can be improved with hierarchical nets. Another way would be to do a fan-out step: to distribute data to all chunk subnets (`M`), first distribute it to all connected peers, which then put it on their chunk subnets. The chunks can be content-addressed like attestations, so duplicates don't hurt. Gossipsub already has similar fanout functionality (push to all peers, even if outside of joined topics mesh). Note that compared to approach 2, *this shifts most of the work to proposers*. Which is good, since the publishing task should be more flexible than the subscriber work, and there are many more subscribers than publishers that need to run. The larger the subnet count, the better the sampling would be. Each subnet is what counts towards the random sample taken by validators. The chunks are simply split `total_chunk_count / subnet_count`. To avoid DoS and validate chunks in gossip, either: - the proof material for validity of chunks is made globally available - the proof material is added to the chunk gossip messages, if small enough #### Weaknesses Now weaknesses in sampling relate to correlation and predictability: Correlation: by staying on net `i`, you get the same random series of chunk indices you would get as others on `i`. Predictability: by staying on net `i`, and some proposer knowing your presence in advance, they can try to only publish on the subnets you are on, and omit the rest. The add-on of shuffling the `subnet <> chunk_index` mapping really only contributes to detach certain chunk indices from being stuck to the same subnet forever. I think there is some marginal value to this, not every subnet may be the same. Not every node has the exact same sequence anyway (overlapping, but different start and end), and always validating the same chunk indices (and same shard) seems worse. #### Mitigating the weaknesses As a group the honest validators should already be safe: they each participate in their own selection of random subnets, and omitting some chunks from a validator would mean not publishing it on the corresponding subnets at all, otherwise the subnet should propagate those missing the chunks. So the concern really is tricking individual validators, and getting them to vote for missing blocks with missing chunks. To mitigate this, validators can still join some subnets randomly, but just part of the time. By joining a subnet randomly (with local randomness, not predictable to attacker) there is a greater chance to get on a network with a missing chunk. The error correction redundancy lowers the amount of subnets that are necessary to trust the random sampling. Still open for more ideas how to increase sampling here. #### Some familiarity, with twists Lots of similar ideas with attestation nets, but used differently: **Existing:** Being subscribed to a few random attestation subnets (A.k.a. the attestation subnet backbone) **Here:** The default, subscribing to `k` subnets, easily able to do work as listener Note: Should be more stable **Existing:** Rotating backbone subnets randomly on some longer interval **Here:** Useful to increase security with little effort, being more resistant against missing chunks **Existing:** Joining unknown subnets on shorter notice, for simple attestation work **Here:** Some extra randomness, good to join some subnet randomly on shorter time, to make predictability harder **Existing:** Attestation subnet bits in ENR and metadata to share where you are. **Here:** It literally doesn't matter who you peer to for random sampling, as long as it is random, and new peers are able to join. Enough to just share "I'm on random subnets" in the metadata, maybe with a counter of how many the peer is on, TBD. Or maybe everyone just shares the details of just a few they are subscribed to longer-term. Leaking just a few of many shouldn't matter, but can really help bootstrap new subscribers, with their own random picks. This could be as small as a few bytes describing a few subnet indices. **Existing:** The aggregate-and-proof subnet is useful for a gated global network. **Existing:** Like a DHT, put some content to some random place to retrieve it from. **Here:** Content is not hashed to decide on location, but it is distributed (randomly or not) between all subnets. Another seed for randomness may work better, but after that step, the gossip messages are content-addressed. Parameter similarity: - `RANDOM_SUBNETS_PER_VALIDATOR` - `k` - `EPOCHS_PER_RANDOM_SUBNET_SUBSCRIPTION` - the slow rotation (incl. random rotation lengths) - `ATTESTATION_SUBNET_COUNT` - amount of chunk subnets - validator shuffling - chunk shuffling ### To be decided The interesting bits to decide: - Shuffling add-on; needs randomness to put chunks on subnets. Hash of shard block may be good enough. Need better general sampling still, if subnet count is low. - Parameters: - Amount of chunk subnets (could be a more than with attestations, if it's random work anyway) - Data availability; `N` chunks, `k` samples, exact details about chunk size (do we think of them as ranges, or the 32 byte pieces?) - Rotation: epochs for slow rotation of the `k` subnets, and slots for some `q` random subnets getting rotated more quickly (both applied with random variance). - Initial discovery of `k` random subnets. - A small constant `q` for the amount of subnet indices to intentionally leak, to bootstrap others who are joining. - Approach to publishing messages - Formatting / contents of proof of chunks, to validate gossip messages (and avoid DoS) And the more boring details to decide: - Encoding details of messages - Topic naming And then testing the idea, probably starting off with chunkifying the input and publishing the chunks to many subnets. The subscriber side should be relatively simple. ### Add-ons And add-ons briefly discussed on the call but not described as much: - Sentry nodes that can reconstruct the missing data, and publish the reconstructing data - For the more powerful nodes, an option to listen in on the full shard data Both could be really useful to fill gaps in whatever subnet is missing chunks. Error correction and has some direct use-case, and topics are a clear go-to for nodes that have the full data and can publish it as necessary.