# DAS requirements [toc] ## Overview Data availability sampling ensures that a block of data is available for download to a number of observers, without downloading the full data. It works by: 1. Encoding the data block using an erasure code, e.g. using Reed-Solomon codes 2. Distributing the data samples in a peer to peer data structure, e.g. a distributed hash tables 3. Nodes query a number of random samples (determined by the statistical security parameter) to ensure that the data is available For example, assuming that the data is Reed-Solomon encoded with a rate r=0.5, and setting the security parameter to 1e-9 (i.e. one in one billion unavailable data blocks can pass an individual node check) requires 30 samples to be downloaded in step 3. The guarantees provided by this construction are: 1. An attacker that does not control the peer to peer data structure can only trick a vanishingly low number of nodes (determined by the security parameter) to accept the data 2. An attacker that does control the peer to peer data structure/can isolate nodes is able to trick only a small, constant number of nodes into accepting that the data is available, before giving away so much of the data that the full data can be reconstructed (the peer to peer construction should ensure that reconstruction will happen in such a case) 3. Except for the small number of nodes potentially tricked under point 2, eventually all nodes should agree on the availability of the data; if some samples are missing, they should be reconstructed so that nodes that depend on these samples will see the as available. ## Example construction The default construction that has been put forward for this in the past is to use a DHT like Kademlia to store the data samples. While Kademlia has some excellent properties, there are great concerns on its robustness under attack. Of particular concern are sybil attacks: Nodes can spam the table, potentially populating parts of it very densely and making them unusable. We are looking for a construction with better resilience in the face of adversaries. Additionally disseminating large amounts of data *into* a DHT might require new routing algorithms and other considerations to do so in a effective but relatively low-bandwidth manner. ## Parameters For details see [here](https://notes.ethereum.org/@dankrad/danksharding_encoding) Please take the parameters below as a rough guideline, as detailed specifications are subject to change. * The payload data (pre-encoding) is approximately 32 MB * The extended (encoded) data is 128 MB * It is arranged in a square (can be a rectangle) of size 512x512 samples with the properties * Each sample is authenticated (using the KZG polynomial commitment scheme) * The correctness of the encoding is ensured by the KZG proofs * The original data is in the top left 256x256 square of the samples * Each sample is 512 bytes, plus one KZG proof of 48 bytes for authentication * Each row and each column form a Reed-Solomon code with rate 0.5, so each row and column can be individually reconstructed ## Requirements 1. Disseminate rows and columns to validators * Validators download a number of rows and columns so that they can take part in reconstruction) * Note that validator node IDs are not publicly known, so this should be through a means that does not expose validators more than they already are 3. Disseminate data into the peer to peer data structure that can be used to query samples 4. Support queries for random samples in an efficient and safe way. This can potentially be broken down into two version of the problem * Live sampling to follow the head * Historic sampling (to some slot depth) to do DA checks from some historic block to the head. Also critical in the event "live" fails and you fall behind. NOTE: historic sampling can perform "live" requirement if fast and safe enough 5. Identify and reconstruct missing data (to then disseminate, 1 & 2) ## Networking building blocks The following are the current peer to peer building blocks being examined to implement data availability sampling: * Peered topic-based gossip (libp2p gossipsub) * Peered request/response (libp2p) * DHT udp queries (discv5) * DHT structured information passing (discv5) * Tieing validators to nodes and directly incentivizing aspects * Leveraging new consensus role or centralized DA provider See [here](https://notes.ethereum.org/@djrtwo/SJBMbhGw5) for an informal discussion of the networking building blocks under consideration for DAS in Ethereum.