# Alexandria: Chain History over Discovery V5.1 DHT
Welcome to the Alexandria specification!
> Note that this is a work in progress and is subject to incompatible changes without notice
Alexandria is an overlay network to the Discovery V5.1 network which facilitates distributed storage and retrieval of historical chain data for the Ethereum blockchain.
The design of the system loosely follows the *"standard"* Kademlia DHT structure with each piece of content being associated with a *key* and each *key* being mapped to a specific *node-id* in the network. Nodes on the network perform the following functions.
* storing content and serving requests for that content.
* maintaining an index of where content can be found on the network.
### Overlay Network
The overlay network is a Kademlia routing table as documented in the [Discovery Wire Protocol](https://github.com/ethereum/devp2p/blob/4ba94c8a2ab5781d754d56ba840abfe38f502342/discv5/discv5-theory.md). The overlay network routing table is maintained separate from the core Discovery V5 routing table and comprised only of nodes that support the Alexandria protocol.
> TODO: How does one populate this routing table. ENR records, or do we implement a separate `FIND_NODES/FOUND_NODES` set of messages within the sub-protocol.
### Content and Keys
Each piece of content `C` is associated with a key `K`. Each key is a sequence of bytes. Each key is mapped to a 32 byte *contentId* via the cryptographic hash function `hash(K)`.
The *contentID* and *nodeID* share the same keyspace and are effectively interchangeable, however this document will attempt to use *contentID* when referring to content, and *nodeID* when referring to network nodes.
> TODO: `hash(...)` needs to be defined. Probably `sha256` for simplicity.
### Distance Function
> TODO: same as routing table distance function
### Finding Content
Each node in the network is uniquely identified by their *nodeID*. Each node on the network is responsible for maintaining an index of which nodes on the network are known to have the data for content that maps to a *contentID* that is in *close* proximity to their own.
To find a piece of content `C` with *contentID* `CID` they would perform an iterative lookup to find the nodes in the network that are closest to `CID`. These nodes can then be sent a `LOCATE` request with the desired *contentID*. The queried nodes will inspect their index and return then *nodeID* of any nodes who have advertised that they have the desired content.
Note that content can be stored *anywhere* on the network, and it is only the knowledge of where content is located that is tied to the content's *contentID*.
### Advertisements by the numbers
The current Ethereum mainnet has approximately 10,000 nodes. We can establish this as a rough estimate of the lower bounds of projected network participation. As an arbitrary upper bound we'll assume potential for 100,000 nodes.
Depending on how you look at the data that needs to be stored, there are a bit over 2 billion items that need to be indexed in the DHT. 1 billion transactions, 1 billion receipts, and 10 million headers and blocks.
This means that each node will *on average* need to make available between 20,000 - 200,000. We'll use the number 100,000 for simplicity. Each of these items is individually addressable, meaning that a node coming online will need to to disseminate 100,000 advertisements around the network.
For this to be efficient, we'll need it to be able to happen relatively quickly, so lets put an upper bound of 10 minutes on the "bootstrapping" process where a node comes online with an existing "store" of data. That means that it has 600 seconds to push 100,000 advertisements, requiring a rate of 166 advertisements per second. Each advertisement will contain a key which is likely to be between 64-128 bytes long. In terms of total data this means only about 6-100mb of data depending on how big the advertisement payload is.
These numbers seem attainable, but there are likely efficiency gains that can be made.
### Replication factor
Using the gossip numbers from above, let us examine what the expected replication factor would be for the network. As of September 2020, the storage of an Ethereum node looks roughly as follows.
| DATABASE | CATEGORY | SIZE |
| Key-Value store | Headers | 50.49 MiB |
| Key-Value store | Bodies | 3.41 GiB |
| Key-Value store | Receipts | 4.08 GiB |
| Key-Value store | Difficulties | 6.17 MiB |
| Key-Value store | Block number->hash | 5.06 MiB |
| Key-Value store | Block hash->number | 421.48 MiB |
| Key-Value store | Transaction index | 27.54 GiB |
| Key-Value store | Bloombit index | 1.64 GiB |
| Key-Value store | Trie nodes | 80.33 GiB |
| Key-Value store | Trie preimages | 1.27 GiB |
| Key-Value store | Account snapshot | 0.00 B |
| Key-Value store | Storage snapshot | 0.00 B |
| Key-Value store | Clique snapshots | 0.00 B |
| Key-Value store | Singleton metadata | 151.00 B |
| Ancient store | Headers | 4.52 GiB |
| Ancient store | Bodies | 98.20 GiB |
| Ancient store | Receipts | 44.57 GiB |
| Ancient store | Difficulties | 166.26 MiB |
| Ancient store | Block number->hash | 387.27 MiB |
| Light client | CHT trie nodes | 1.21 GiB |
| Light client | Bloom trie nodes | 1.99 GiB |
| TOTAL | 269.77 GIB |
Ignoring the nuances of this data we can place a rough upper bound of 500GB of unique data that needs to be store for the near-term future.
Assuming the network has between 10,000 - 100,000 nodes, and each node on average contributes 1GB of storage to the network, we can estimate the network's total capacity to be between 10TB-100TB. From this we can derive a replication factor somewhere between 20-200
This means that a naive advertisement scheme in which each advertisement must be sent directy to each node that stores it would require 20x-200x more messages.
#### Gossip Theory
Each advertisement relates to a specific `content_id` which inturn, maps to exactly one node which is *closest*, and thus ultimately responsible. One desirable approach would be for nodes that receive an ad to relay it onwards towards nodes that are closer to the `content_id`. Naively implemented, this scheme poses an amplification vector since a single message with a bogus ad would trigger the receiving node to relay it to whatever nodes it knows that are closer to the the `content_id`. For this reason, we would need a mechanism to combat this type of malicious amplification.
Ideally, we would want a system that has the following properties.
- advertisements are only sent to nodes that are willing/interested in storing them.
- only *valid* advertisements are sent
- not forged
- ideally validate that the have the necessary data
A node on the network would allocate some amount of storage `S` towards ads. Each add corresponds to a specific `content_id`. We can define the *radius* of their stored ads as the maximum distance for any ad from their local `node_id`. As a node encounters more content and their storage becomes full, they will begin evicting the ads that are furthest from their `node_id` as they encounter new ads which are closer and thus more interesting to them.
If nodes published their current ad radius, then other nodes can know whether an ad is interesting. My maintaining up-to-date information about the ad radius for the nodes in your routing table, we can then ensure that ads are only sent to nodes that would be *interested* in them. Current the `ADVERTISE` message doesn't have a defined response. Under this scheme, the response could contain the ad radius so that the sender then knows whether subsequent advertisements should be sent.
For an advertisement to be valid we want to ensure that the node *serving* then content actually has the data and that the ad was not forged. We can used digital signatures to prevent forgeries at the expense of roughly 65 bytes per advertisement.
As for ensuring someone has the data they claim to have... **if** we had a merkle root of the data, then it would be relatively easy to do some sort of proof of custody by requesting a random chunk of the data. The problem here seems to be that we typically won't have a merkle root, but rather a boring `keccak(data)` which doesn't allow us to verify the data until we have it in its entirety. See below in the "Transmitting Data" section on using SSZ trie hashing which would facilitate data audits.
### Gossip style PUSH Advertisements
In order to avoid requiring nodes to send each advertisement to 20-200x nodes due to replication, we want a scheme where each advertisement need only be sent once, and it will be automatically replicated in the appropriate region of the network.
Let us explore a potential algorithm for doing this.
Given a piece of content with content ID `C` we expect between 20-200x nodes near `C` to be storage locations. We'll suppose that each node maintains a Kademlia routing table and that for each node in their routing table, they also maintain the node's advertised storage radius `R`.
Our gossip system should aim for two properties.
- A: The node in the network that is **closest** to `C` will find out about the content.
- B: The majority of nodes in close proximity to `C` will find out about the content.
To accomplish objective A, a node wishing to advertise `C` will perform an iterative lookup for the nodes closest to `C` and send the advertisement to the `n` closest lively nodes (probably `n=3` for some redundancy).
Upon receiving an advertisement, a node will do the following.
1. *if* `C` is not of interest to them, they will ignore it.
2. otherwise, they will verify the advertisement, store it, and forward it onto the node in their routing table that is closest to `C`.
The above ensures we satisfy objective A.
To satisfy objective B we need the advertisement to be forwarded to the majority of nodes `N` each with advertisement radius `R` such that `distance(N, C) < R`. In order to do this efficiently we would probably want to spread the load across as many nodes as possible. This is typically done by doing something like computing the set `[n1, n2, ..., n]` of nodes in the routing table that satisfy the broadcast condition and then randomly selecting a logarithmic or square-root number of those peers and only broadcasting to them.
This mechanism ensurse that objective B is met, propogating each advertisement to the broader area of the network where it's content Id resides.
### What each node tracks
#### Advertisement Raidus: `R`
Each node should track `R` which is the largest distance from the nodes `node_id` to the `content_id` for any of its stored advertisements.
This is a value that needs to be semi-regularly updated. We could potentially transmit this in something like a `Pong` message.
#### Verified Merkle Roots
Assuming advertisements include merkle roots, nodes should retain at minimum an ephemeral mapping of `Key -> root` for all merkle roots that have been fully validated.
#### Which Advertisements have been sent
To avoid re-broadcasting an advertisement to the same peer multiple times, a recent cache of which advertisements have been sent to which nodes should be kept. A simple LRU cache would probably be fine.
### Transmitting Data
We need to make some decisions on how content payloads will be transferred in the DHT.
We can expect a rough upper limit of 1kb transfer per datagram. A brief survey of recent blocks suggests block sizes around 30kb meaning that it will take 30x messages to transmit a single block body.
A naive approach would be to request the data with one message, and then wait for all of the response messages to come in, after which you can re-assemble content. Presumably clients would also then validate that the content is well formed such as reconstructing the transaction and uncle tries and checking them against he header.
One of the simplest ways that this approach fails is if a single message is lost. Without the ability to request specific "chunks" of the data, the full data would need to be re-requested. This also means that you must get *all* of the data from a single source even though there may be multiple sources available that you could pull data from concurrently.
* property a: ability to request specific pieces of the data
* property b: ability to retrieve data concurrently from multiple sources.
Another problem with this approach is that the recipient cannot validate the data until they have received the full payload, meaning they may be holding onto many bad chunks before they are able to determine they are junk.
* property c: ability to validate data on a per-message granularity.
Putting these together, we would want a scheme with roughly the following shape.
1. Content would be identified by something like a merkle root.
2. Content requests can be targeted at specific chunks from the merkle tree.
With these two things, we should be able to have all of the desired properties. The problem is that most of our content isn't identified by a merkle root. Maybe we can address this in our advertisement system.
Each advertisement will contain a `key` for some specific piece of content. The key carries semantic meaning which allows each key to be tied to a validation scheme. For example, the header key would contain the canonical hash of that header. Verification of the data would involve checking that `keccak(rlp(header)) == expected_hash`.
Other data however, such as single transactions or a block of transactions require more specialized validation, either re-constructing the `transactions_root` and verifying that it matches the corresponding field in the header, or potentially, for single transactions, verifying that an accompanying proof is valid against the `Header.transactions_root`.
We could add a merkle root to each advertisement. Verification of this root would be possible by fetching the underlying data. Since the hash is a merkle root, it would also allow fetching individual pieces of the trie, each of which would be verifiable individually against the merkle root. Upon recieving two advertisements for the same `content_id` but with different merkle roots, it would be straight forward to figure out which one (if any) was the correct one, and purge all of the remaining ones.
Under this scheme the content retrieval algorithm would be roughly this.
- Gather up advertisements
- Group advertisements by merkle root
- Starting with the most popular root, fetch the data. If the complete data payload does not pass validation, move onto the next most popular root and repeat.
- This process will eventually either succeed at the final validation phase, or will discredit all of the advertisements.
From the perspective of someone storing advertisements, once a merkle root has been verified it would not need to be re-verified. A standard trigger for verification could simply be anytime that two advertisements are found for the same `content_id` but with mismatched merkle roots. Even in an adversarial situation, a client would eventually settle into an equilibrium where all of their advertisements have been validated (and thus they will be able to reject any advertisements for that `content_id` which don't match the already known merkle root)
This all builds to an Advertisement model that looks roughly like:
advertisement := key || valid_until || merkle_root || signature
valid_until := timestamp when this advertisement should be discarded.
valid_until := tim
key := full key for the content
merkle_root := something like an SSZ merkle hash
The signature will be 65 bytes. The Merkle root will be 32 bytes. The key will tend to be less than 100 bytes.
More research TBD on the exact mechanism for retrieval.
### Content Keys and Semantic Meaning
While the keyspace is unbounded and allows for *any* arbitrary byte strings as keys, the protocol dictates how the keyspace is used, and thus restricts the allowed keys as well as embedding semantic meaning into each key.
All keys begin with a common prefix `<prefix>`: `eth/v1`
> TODO: differentiation between different chains/networks/etc in the prefix. `network-id`, `fork-id`, `chain-id`?
The following keyspaces are defined for storing the following different pieces of content.
`<prefix>/header/<hash>` > `rlp(header)`
This path simply returns the RLP encoded header. Validation involves checking that the returned data hashes to the `<hash>` from the key.
`<prefix>/header-accumulator/<epoch-number>` > EPOCH_ACCUMULATOR
WORK-IN-PROGRESS: Use Eth2 header accumulator pattern. Can probably use fixed size SSZ list here and use built-in SSZ merklization and hashing. We should develop this in conjunction with an EIP to modify the block header to include this hash. These would be used for inclusion proofs to allow clients to discard the header chain after validating it.
#### Block Uncles
`<prefix>/block/<hash>/uncles` > `rlp([uncle_0, uncle_1, ..., uncle_n])`
`<prefix>/block/<hash>/uncles/<idx>` > `rlp(uncle)` (proof?)
`<prefix>/block/<hash>/transactions` > `rlp([txn_0, txn_1, ..., txn_n])`
`<prefix>/block/<hash>/transactions/<idx>` > `rlp(txn)` (proof?)
`<prefix>/block/<hash>/receipts` > `rlp([receipt_0, receipt_1, ..., receipt_n])`
`<prefix>/block/<hash>/receipts/<idx>` -> `rlp(receipt)` (proof?)
#### Canonical Indices
`<prefix>/canonical/header/<number>` -> `<hash>`
`<prefix>/canonical/transaction/<hash>` -> `<hash>/<idx>`
All messages are transported over the `TALKREQ/TALKRESP` messages provided by the base Discovery v5.1 wire protocol. Messages are encoded as:
payload = request-id || message-id || message
request-id = 32 bit big endian
message-id = 8 bit big endian
message = encoded message bytes
All messages use the protocol identifier `TODO`
### FIND_NODES | message-id == 1
TODO: Same as base protocol but in SSZ
### FOUND_NODES | message-id == 2
TODO: Same as base protocol but in SSZ
### ADVERTISE | message-id == 3
TODO: List of keys that node is advertising they provide
> TODO: Should this be signed to prevent counterfits? Since node-ids are cheap, does counterfit prevention get us anywhere?
> TODO: Should this have a response message???
### LOCATE | message-id == 4
TODO: Query the nodes advertisements for any nodes they know about which serve a given key (or keys). Expects a FOUND_NODES in response.
### RETRIEVE | message-id == 5
TODO: request a piece of content by its key
### DATA | message-id == 6
TODO: multi-part response to a RETRIEVE
> TODO: Griefing vectors, merklization and chunking of content
## Network Data Requirements
TODO: Work out the following.
- Total size of all data stored in the network broken down by each type of data.
- Rate of data growth bith in GB and number of items
- Total number of entries for each piece of data to be stored.
- Table/Chart showing number of nodes and individual node storage requirements across the following three axis:
- Number of network nodes
- Replication factor
- Basic Performance Projections
- Number of advertisements a standard node will have to make when coming online.