This document attempts to explain a proposed mapping from the Ethereum state trie onto the nodes in a DHT network in such a manner that:
O(1)
access costEach node in the DHT has a node_id
which we assume are evenly and randomly distributed across the 32 byte keyspace.
For each possible leaf in either the main account trie or an individual contract storage trie, we say it has a content_id
which dictates it’s location within the network. We then define a distance function which determines the proximity between a node_id
and a content_id
MODULO = 2**256
MID = 2**255
def distance(node_id: int, content_id: int) -> int:
"""
Source: https://stackoverflow.com/questions/28036652/finding-the-shortest-distance-between-two-angles
A distance function for determining proximity between a node and content.
Treats the keyspace as if it wraps around on both ends and
returns the minimum distance needed to traverse between two
different keys.
Examples:
>>> assert distance(10, 10) == 0
>>> assert distance(5, 2**256 - 1) == 6
>>> assert distance(2**256 - 1, 6) == 7
>>> assert distance(5, 1) == 4
>>> assert distance(1, 5) == 4
>>> assert distance(0, 2**255) == 2**255
>>> assert distance(0, 2**255 + 1) == 2**255 - 1
"""
delta = (node_id - content_id + MID) % MODULO - MID
if delta < -MID:
return abs(delta + MID)
else:
return abs(delta)
In the main account trie, an account with address A
is stored under the key keccak(A)
. We will refer to this key as account_trie_key = keccak(A)
. These account_trie_key
values are randomly and evenly distributed across the 32 byte keyspace.
The content_id
for a leaf node in the account trie is computed as content_id = account_trie_key
:
Nodes in the network would be responsible for storing a contiguous section of account leafs that are closest to their node_id
along with the accompanying merkle proof data to anchor that data to the latest state root.
Each contract has its own individual storage trie referenced by Account.storage_root
. Each of these tries vary in size depending on the contract. Each piece of data in a contract storage trie has a slot
number which is mapped to the key storage_trie_key = keccak(slot)
. The data in each individual contract storage trie is randomly and evenly distributed across the 32 byte keyspace.
While each contract has a random and evenly distributed set of keys across the 32 byte keyspace, across multiple contracts, commonly used storage slots such as the zero slot will have the same storage_trie_key
across multiple contracts, meaning that the set of all contract storage keys are not evenly distributed aross the keyspace.
The content_id
for a leaf node in a contract storage trie for a contract at address A
is defined as:
MOD = 2**256
def get_content_id(contract_address: int, slot: int) -> int:
account_trie_key = keccac(contract_address)
storage_trie_key = keccak(slot)
content_id = (account_trie_key + storage_trie_key) % MOD
By offsetting the storage_trie_key
by the account_trie_key
we get the end result that the content_id
for the set of all contract storage items across all contracts are randomly and evenly distributed across the keyspace.
Each node in the network will store the set of leaves and the accompanying proofs anchoring those leaves to the state root that are closest to their node_id
.
In practice, this means:
The result is that individual nodes in the network will end up storing some section of the main account trie, and some section of the contract storage for a subset of the contracts.
For nodes to manage their stored data, they will need to examine the set of all content_id
values for the data they are storing, and prune the leaf data of the furthest content when their storage is full.
Recall that we want this storage scheme to satisfy the conditions
O(1)
access costWe know A
is satisfied because both schemes for deriving content_id
preserve locality within the individual tries.
For both the main account trie and the contract storage tries, the content_id
values are randomly and evenly distributed. We can also infer that the merged set of these values is also randomly and evenly distributed. This satisfies condition B
.
We know C
is satisfied because nodes are free to choose how much of the tries they wish to store.
We know D
is satisfied because for any data that someone wishes to access, they are able to immediately identify which nodes in the network should have that data, request it, and verify it using the accompanying proof.
One requirement is that nodes in the network can make multiple requests across disparate peers for the same state root. This can be accomplished by:
Nodes would would be storing multiple proofs:
For each trie, the node would store a single proof for the latest state root for which they have received data. Each time updated data is received, they would compute a diff against their previous latest HEAD proof, storing a set of reverse diffs allowing them to rewind the latest HEAD proof to a previous state root. Nodes can then choose to discard these revers diffs beyond some recent time horizon such as 256 blocks.
Requests for data should come with an accompanying recent state root for which the request is anchored to. Nodes would only respond to requests for data within the range of recent state roots they track.
This storage scheme should not need any special functionality added to support garbage collection of data that is no longer part of the trie.
Pushing data into the network requires generation and distribution of a single proof that contains all leaf data that was touched, modified, or deleted.