owned this note
owned this note
Published
Linked with GitHub
# Alexandria
Welcome to the Alexandria specification!
> Note that this is a work in progress and is subject to incompatible changes without notice
Alexandria is an overlay network to the Discovery v5.1 network which facilitates distributed storage and retrieval of historical chain data for the Ethereum blockchain.
The design of the system loosely follows the “standard” Kademlia DHT structure with each piece of content being associated with a key and each key being mapped to a specific location in the network.
## Table of Contents
TODO
## Concepts
### Overlay Network
The overlay network is built on top of the [`TALKREQ`](TODO) and [`TALKRESP`](TODO) message pair which is exposed by the discovery v5.1 protocol. Nodes participating this network will maintain a separate routing table comprised only of nodes which are known to support the Alexandria protocol.
The Alexandria protocol uses its own version of the message pairs `PING/PONG` and `FIND_NODES/NODES` to manage the routing table and traverse the network.
### Nodes
The term *Node* is used to refer to a participant in the network. Each node is uniquely identified by its 32 byte `node_id` from the core discovery network.
### Content and Keys
The term *Content* is used to refer to a piece of data such as a block header.
Each piece of content is associated with a *Key* which is a string of bytes.
Each piece of content is uniquely identified both a *Key* which is a string of bytes and a 32 byte `content_id` which is derived from the key via `hash(key)`
> TODO: `hash(..)` is meant to serve as a placeholder for whatever hash function seems appropriate. Likely `sha256` or `keccak` for simplicity.
### Distance Function
The distance function is used to both determine the distance between nodes as well as the distance between a node and a piece of content.
### Advertisements
An Advertisement is how nodes in the network make other nodes aware of what content they are storing. Each advertisement has the following parts.
- Key: The key of the content this advertisement is for
- Signature: A digital signature to prevent forgeries
- NodeID: The `node_id` of the node publishing the advertisement.
- Expiration: The timestamp when this advertisement should no longer be considered valid.
- Merkle Root: The merkle root of the content data
Implicitely, each advertisement has an associated `content_id` derived from the advertisements key.
Each node in the network participates in the advertisement system by maintaining a table of advertisements whose `content_id` is *close* to their own `node_id`.
We refer to the *Radius* of the advertisement table as the largest distance between the `node_id` and `content_id` for the set of content ids present in the table. The radius is strictly bounded to the range `0 - 2**256-1`.
## Node Responsibilities
### Maintaining the Routing Table
Nodes should ensure the liveliness of nodes in their routing table. Nodes should track the timestamp of the last time a successful `PING/PONG` was exchanged with each node in their routing table, and periodically ping node which was least recently pinged.
Nodes that fail the liveliness check should be evicted from the routing table. Similar to the base protocol, it is recommmended that nodes keep a replacement cache to ensure that their routing table buckets stay healthy.
### Tracking Advertisement Radius
The `PING`, `PONG`, and `ACK` messages all contain the `advertisement_radius`. It is recommended that nodes maintain something like an LRU cache of the most recently seen value.
See "Advertisement Gossip" for more information on how this value is used.
### Maintaining the Advertisement Table
Nodes are expected to keep track of advertisements that are *close* to their `node_id` in their Advertisement Table. A recommendation for implementations is for the table to have a maximum number of allowed advertisements and to perform eviction using the following rules when the table is over capacity.
An advertisement should be evicted if:
- It is expired
- The Node publishing the advertisement is not reachable.
- There are too many other advertisements for the same `content_id`.
- Its `content_id` is the most distant from the nodes `node_id` than any other advertisement.
Periodic liveliness checks should be done, evicting the advertisements published by any node that is found to be unreachable.
### Advertisement Gossip
To publish an advertisement to the network, one should first perform a lookup for nodes that are close to the `content_id`. The found nodes should be filtered by their published advertisement table *radius* to remove any nodes for which the content would fall outside of their published radius.
The advertisement should then be sent via the `Advertise` message to the closest node as well as a randomly selected logarithmic subset of the remaining found nodes.
Upon receiving an advertisement nodes should perform the following validation.
If the advertisement is already present in the local advertisement table, the advertisement should be discarded.
Next, the liveliness of the advertisement's Node should be verified. If the node is unreachable, the advertisement should be discarded.
Next, a check should be performed against any advertisements for the same key which are already present in the advertisement table against the advertised merkle root. If the merkle roots differ, then full validation should be performed to determine the correct root. Nodes should maintain cache of known `content_id -> merkle_root` so that any future discrepancies for a previously validated `content_id` do not have to be re-validated.
Otherwise, if there are no discrepancies between the advertisement's merkle root and ones already present in the advertisement table, a proof of custody check SHOULD be performed (TODO: depends on merklization).
At this stage, the advertisement should be inserted into the local advertisement table. Then, using either an iterative *lookup* or simply querying the local routing table, the advertisement should be forwarded to a logarithmic subset of the nodes for which the `content_id` falls within their advertisement radius.
### Tracking Canonical Merkle Roots
Advertisements and Content retrieval both operate on pairs of `(content_key, hash_tree_root)`. Since the `hash_tree_root` for each piece of content is not part of the core Ethereum protocol, we cannot know this value in advance. This means adversarial nodes can distribute advertisements with `hash_tree_root` values that do not represent the actual content.
However, since all content on the network can be validated against the header chain, nodes can resolve this situation by fetching the content for the various `hash_tree_root` values they obvserve to determine which of them is valid.
Nodes are encouraged to keep a cache of validated `(content_id, hash_tree_root)` pairs to minimize how often they need to re-validate content.
### Collecting, Storing, and Serving Content
Implementations are encouraged to collect, store and serve content for the network.
A simple approach to content collection is to monitor incoming advertisements for Content with a `content_id` that is *near* the node's `node_id`. Once an advertisement has been validated, the node can fetch, store, and serve this content. Nodes can allocated a fixed amount of storage and evict the content that is furthest away when that storage is full.
Nodes can also monitor `Locate` requests to learn about new content.
## Navigating the Network
### Finding Nodes
TODO: copy/pasta from DiscoverV5 spec
### Distributing Advertisements
### Finding and Retrieving Content
To find content in the network we must first find nodes in the network that are *close* to the content's `content_id`. One or more of these nodes should then be queried with a `Locate` message to collect advertisements for the content.
Advertisements should be grouped by their merkle root. In non-adversarial situations there should only be a single merkle root, but there is nothing to prevent fraudulent merkle roots.
### Resolving merkle root disputes
Starting with the most numerous merkle root.
Issuing `Retrieve` requests until the full content has been re-assembled.
## Content Proofs
Content transmission in Alexandria is done using SSZ partial proofs.
Content is serialized using the sedes: `List[uint8, max_length=2**30]` which we refer to as `content_sedes`.
### Terminology
For demonstration purposes we will use the sedes: `List[uint8, max_length=512]` which gives us a manageably small tree.
For the "data" we'll use the following hex encoded byte string.
```
000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f
202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f
404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f
606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f
808182838485868788898a8b8c8d8e8f
```
The `content_length` for this data is `144`.
The data gets chunked into five chunks.
> Note that the last chunk gets padded with zeros.
```
chunk-0: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f
chunk-1: 202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f
chunk-2: 404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f
chunk-3: 606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f
chunk-4: 808182838485868788898a8b8c8d8e8f00000000000000000000000000000000
```
The merkle tree for this data would look as follows.
```
0: R
/ \
1: 0 L
/ \
/ \
/ \
/ \
/ \
/ \
/ \
2: 0 H
/ \ / \
/ \ / \
/ \ / \
3: 0 1 0 1
/ \ / \ / \ / \
/ \ / \ / \ / \
4: 0 1 0 G 0 1 0 1
/ \ / \ / \ / \ / \ / \ / \ / \
5: A B C D E F 0 1 0 1 0 1 0 1 0 1
|<-----DATA---->| |<------------PADDING-------------->|
+------------------+ +------------------+ +------------------+
| Data | | Padding | | Length |
+==================+ +==================+ +==================+
| element | path | | element | path | | element | path |
+---------+--------+ +---------+--------+ +---------+--------+
| A | 000000 | | F | 000101 | | L | 0 |
+---------+--------+ +---------+--------+ +---------+--------+
| B | 000001 | | G | 00011 |
+---------+--------+ +---------+--------+
| C | 000010 | | H | 01 |
+---------+--------+ +---------+--------+
| D | 000011 |
+---------+--------+
| E | 000100 |
+---------+--------+
```
Each node in the tree can be addressed by its "Path". Paths are sequences of bits, represented by `0` and `1`. Every node has a 32-byte value. In this tree, paths are limited to 6 bits. For the `content_sedes` used in the protocol, paths are limited to 26 bits.
The `content_length` is always found at the path `1` in the tree in its little endian encoded format, padded with zeros to 32 bytes.
The term "Chunk" to refer to the leaf nodes in the merkle tree under the `0` path. In the tree above, there is a maximum chunk count of `512 / 32 -> 16`. For the `content_sedes` used in the protocol, the maximum chunk count is `(1024 * 1024 * 1024) / 32 => 33554432`.
Chunks can be addressed by their "Chunk Index" with the leftmost chunk in the tree having index `0` and counting up for each leaf node.
We refer to the chunks that store the actual content data as "Data Chunks". In this tree, the data chunks are labeld as `(A, B, C, D, E)` which correspond to the chunk indices `(0, 1, 2, 3, 4)`.
All of the chunks that don't store content data are referred to as "Padding". The padding portion of the tree is typically condensed to the minimal set of nodes necessary to represent the subtrees. For example the 8 padding chunks on the right side of the tree (from index 8 through 15) can be collapsed to only the node labeled `H` at path `01`.
The "Root" node of the tree is located at the "null" or "empty" path. Its value is the `hash_tree_root` which is the merkle root used to reference the underlying content.
The term "Partial" is used to refer to a merkle proof for a subset of the actual content data.
### Proofs and Partials
The term "Proof" is used to refer to a collection of nodes from the merkle tree. A node has a `path` which defines its location in the tree and a 32-byte `value` which is always one of:
- A data chunk
- A padding chunk
- An intermediate node (the hash of the nodes two children)
- The root node (just a special intermediate node)
- The length node found at the path `1`
```
Node = (path, value)
Proof = (Node_1, Node_2, ..., Node_N)
```
> The `Node` elements of a proof tend to be lexicographically sorted.
A path `P1` "prefixes" another path `P2` if `P2` starts with `P1`
```python
def is_prefix(p1: Path, p2: Path) -> bool:
return all(p2[idx] == el for idx, el in enumerate(p1))
```
#### Proof Validity
A proof is considered "well-formed" if for any path `P`, there is a node `N` in the proof such that `N.path` prefixes `P`.
A proof is considered "minimal" if there are no nodes which are the prefix of another node.
A proof is considered valid if it is "well-formed", "minimal".
### Serialization and Deserialization
#### Path Serialization
Path serialization takes advantage of the property that most paths share a common prefix with other paths in the proof. When serializing a path we need access to both the path being serialized `P` and the previously serialized path `Q`. When serializing the first path, we use use the empty path as `Q`.
The serialization algorithm is as follows:
Let `common_bit_count` be the length of the common prefix between `P` and `Q`.
Let `P_tail` be the path bits from `P` that are not part of the common prefix.
Let `P_tail_length` be the number of bits in `P_tail`
Let `P_tail_int` be `P_tail` interpreted as an integer such that the leftmost bits are least significant and the right most bits are most significant.
```python
def path_to_int(p: Path) -> int:
return sum(2**idx for (idx, bit) in enumerate(path) if bit)
```
Serialization involves bit packing `P_tail_length`, `P_tail_int` and `common_bit_count` together into a single integer as follows.
* The lowest 5 bits of the integer are the `P_tail_length`.
* The next `P_tail_length` bits are the `P_tail_int`.
* The remaining bits are the `common_bit_count`.
Using bit-shifting, this would be:
```python
shifted_P_tail_int = P_tail_int << 5
shifted_common_bit_count = common_bit_count << (5 + P_tail_length)
encoded_path_as_int = (
P_tail_length | shifted_P_tail_length | shifted_common_bit_count
)
```
The `encoded_path_as_int` is then LEB128 encoded.
```python
encoded_path = leb128(encoded_path_as_int)
```
Deserialization follows the same algorithm in reverse. Note that we must still have access to the previous path `Q` for full reconstruction of the original path.
First we decode the LEB128 value to get the `encoded_path_as_int`.
We can then extract the `P_tail_length` using a the bit mask `0b11111` (aka `0x1f`).
```python
P_tail_length = encoded_path_as_int & 0b11111
```
Next, we can extract the `P_tail` by constructing a bit mask of the length specified by `P_tail_length`.
```python
P_tail_int = (encoded_path_as_int >> 5) & (2 ** path_length - 1)
P_tail = tuple(
bool(2**idx & P_tail_int)
for idx in range(P_tail_int.bit_length())
)
```
Next we can determine the number of bits from `Q` that are shared.
```python
common_bit_count = encoded_path_as_int >> (5 + len(P_tail))
```
Finally we can reconstruct the full path by combining the common prefix from `Q` with the `P_tail`
```python
P = Q[common_bit_length:] + P_tail
```
For purposes of this specification we define the following two functions to represent this serialization and deserialization logic.
```python
def serialize_path(P: Path, Q: Path = ()) -> bytes:
...
def deserialize_path(encoded_path: bytes, Q: Path = ()) -> Path:
...
```
#### Proof Serialization
Proof serialization takes advantage of properties that:
- All "Padding" nodes from the tree can be reconstructed from the `content_length`
- The `content_length` is always found at the same location in the tree, under the path `1`.
These two properties allow us to compress serialized proofs by:
- Excluding the "Length" node and encoding the `content_length` separately.
- Expluding all "Padding" nodes.
This means that serialized proofs only contain the nodes from the merkle tree which are part of the "Data" portion of the tree which allows us to take advantage of the property that all tree nodes from the "Data" portion of the tree start with the path `0`. This allows us to omit the first bit of each path.
A serialized proof is comprised of the following parts parts.
1. The LEB128 encoded `content_length`
2. The LEB128 encoded number of nodes in the serialized proof.
3. The concatenated `value` elements for the nodes in the proof.
4. The serialized `path` elements for the nodes in the proof.
The `content_length` can be found in the proof node at path `1` which must be part of every proof.
We can compute the last data chunk index from the `content_length`
```python
last_data_chunk_index = (content_length + 31) // 32
```
Let `data_elements` be all of the nodes from the merkle tree between the chunk index `0` and `last_data_chunk_index`.
Proof serialization is then done as follows:
```python
def serialize_proof(content_length, data_elements):
serialized_content_length = encode_leb128(content_length)
serialized_node_count = encode_leb128(len(data_elements))
paths = tuple(el.path for el in data_elements)
values = tuple(el.value for el in data_elements)
first_path = serialize_path(paths[0], ())
tail_paths = b''.join((
serialize_path(paths[i], paths[i - 1])
for i in range(1, len(paths))
))
serialized_paths = first_path + tail_paths
serialized_values = b''.join(values)
return b''.join((
serialized_content_length,
serialized_node_count,
serialized_values,
serialized_paths,
))
```
Deserialization is done as:
1. Decode the LEB128 encoded `content_length`.
2. Decode the LEB128 encoded `node_count`.
- TODO: this choice results in serialized proofs having one extra byte. By interleaving the paths/values we can eliminate the need for the node-count.
3. Pull off the next `node_count * 32` bytes. These are the node values.
4. Decode the remainder of the payload as LEB128 encoded values and use `deserialize_path` to decode them.
The "length" and "padding" nodes can be reconstructed from encoded `content_length`.
TODO: outline how the padding elements are reconstructed or link to the DDHT implementation.
The reconstructed proof should then be validated for as "minimal" and "well-formed".
### Proof Merging
When requesting content that doesn't fit into a single packet, clients will need to merge multiple partial proofs together until they have all of the data.
Implementations need to be able to figure out which chunks are still missing from a partial proof.
### Canonical Header Accumulator
All data in the Alexandria network is anchored to the header chain. At present, the header chain consists of ~11 million headers which add up to about 6GB of storage space.
The Alexandria network uses an accumulator to enable nodes to have proofs about historical header canonicalness without requiring them to store the full historical header chain.
The accumulator is made of of two SSZ data structures.
#### The Epoch Accumulator
The "Epoch" accumulator is defined as an SSZ List containing the block hash and computed total difficulty of a contiguous sequence of `EPOCH_SIZE` headers.
```
HeaderInfo = Container(
block_hash: bytes32,
total_difficult: uint256,
)
EpochAccumulator = List[HeaderInfo, max_length=EPOCH_SIZE]
```
#### The Master Accumulator
The "Master" accumulator houses the `hash_tree_root` values from the "Epoch" accumulators for each epoch.
```
List[bytes32, max_length=2**20]
```
#### Filling the Accumulators
At each block, the block hash and computed total difficulty are added to the current "Epoch" accumulator for the current epoch. This accumulator becomes "full" at each epoch boundary, at which point it's `hash_tree_root` is added to the "Master" accumulator, and a fresh "Epoch" accumulator is created for the next epoch.
#### Canonical Header Proofs
Each header should be accompanied by a proof that demonstrates it is part of the canonical chain of headers. This proof is comprised of two SSZ partial proofs, one against the "epoch" accumulator for the epoch the header belongs to, and another against a recent version of the "master" accumulator which proves the `hash_tree_root` of the epoch accumulator is part of the canonical history.
## Wire Protocol
All messages in the Alexandria protocol are transmitted using the `TALKREQ` and `TALKRESP` messages from the base protocol.
All alexandria messages have a `message_id` and `encoded_message` that are concatenated to form the `payload` for either a `TALKREQ` or `TALKRESP` message.
```
payload := message_id | encoded_message
message_id := uint8
encoded_message := bytes
```
The `encoded_payload` component is the SSZ encoded payload for the message type as indicated by the `message_id`. Each message has its own `sedes` which dictates how it should be encoded and decoded.
The SSZ sedes `byte_list` is used to alias `List[uint8, max_length=2048]`.
All messages have a `type` which is either `request` or `response`.
* `request` messages **MUST** be sent using a `TALKREQ`
* `response` messages **MUST** be sent using a `TALKRESP`
### Ping (0x01)
Request message to check if a node is reachable, communicate basic information about our node, and request basic information about the other node.
```
message_id := 1
type := request
sedes := Container(enr_seq: uint32, advertisement_radius: uint256)
```
* `enr_seq`: The node's current sequence number of their ENR record
* `advertisement_radius`: The nodes current advertisement radius.
### Pong (0x02)
Response message to Ping(0x01)
```
message_id := 2
type := response
sedes := Container(enr_seq: uint32, advertisement_radius: uint256)
```
* `enr_seq`: The node's current sequence number of their ENR record
* `advertisement_radius`: The nodes current advertisement radius.
### Find Nodes (0x03)
Request nodes from the peer's routing table at the given logarithmic distances. The distance of `0` indicates a request for the peer's own ENR record.
```
message_id := 3
type := request
sedes := Container(distances: List[uint16, max_length=256])
```
* `distances` is a list of distances for which the node is requesting ENR records for.
* Each distance **MUST** be within the inclusive range `[0, 256]`
* Each distance in the list **MUST** be unique.
### Nodes (0x04)
Response message to FindNodes(0x03).
```
message_id := 4
type := response
sedes := Container(total: uint8, enrs: List[byte_list, max_length=32])
```
* `total`: The total number of ENR records being returned.
* `enrs`: List of bytestrings, each of which is an RLP encoded ENR record.
* Individual ENR records **MUST** correspond to one of the requested distances.
* It is invalid to return multiple ENR records for the same `node_id`.
> Note: If the number of ENR records cannot be encoded into a single message, then they should be sent back using multiple messages, with the `total` field representing the total number of messages that are being sent.
### Advertise (0x05)
Send advertisements that the peer is expected to be interested in.
```
message_id := 5
type := request
sedes := List[advertisement, max_length=32]
advertisement := Container(
content_key: byte_list,
hash_tree_root: bytes32,
expires_at: uint40,
signature_v: uint8,
signature_r: uint256,
signature_s: uint256,
)
```
### Acknowledge (0x06)
Response message to Advertise(0x05), acknowledging receipt of the advertisements.
```
message_id := 6
type := response
sedes := Container(advertisement_radius: uint256)
```
### Locate
### Locations
### Retrieve
Request a specific piece of content.
### Data
## Content Key Schema
### Headers
### Uncles
### Transactions
### Receipts
### Canonical Indices