owned this note
owned this note
Published
Linked with GitHub
# Everything You Never Wanted To Know About Serialization
I've been researching and thinking about serialization for a few weeks now. This document tries to:
* Capture what I've learned
* Suggest reasonable courses of action.
## Problem Statement
When talking about serizalization needs we will be looking at the following two primary use cases.
- **Networking**: The act of transporting objects between clients across a network.
- **Consensus**: The process of manipulating objects within the logic of the protocol
We wish to determine what scheme is appropriate for these use cases in the
context of the Eth2.0 network, with consideration taken towards the Eth1.0
network.
The Eth1.0 protocol uses RLP for each of these.
The Eth2.0 protocol uses SSZ for the *Consensus* use case and does not have a defined scheme for the *Networking* use case.
## Evaluation Criteria
In order to make a decision, we must know what the different criteria under
which a serialization scheme should be evaluated.
For the *Networking* use case I pose the following:
TODO: Schemaless at the networking level
- **Major**:
- **compactness**: How space-efficient is the serialized (bytes) representation of the data.
- **expressiveness**: Does the serialization support the data types we use.
- **Minor**:
- **streamability**: Does the serialization support streaming encoding and decoding
For the *Consensus* use case I pose the following criteria.
- **Major**:
- **expressiveness**: Does the serialization support the data types we use.
- **hashing**: Can data structures be efficiently hashed and re-hashed after minor modifications.
- **Minor**:
- **indexing**: The act of accessing the inner values of a data structure without fully deserializing it.
### On "Expressiveness" of data types
This document works under the assumption that we **require** support for the following data types.
- Booleans
- Unsigned Integers:
- 64 bits for beacon chain
- 128 bits for shard chain
- 256 bits for eth1.0 chain
- Fixed length byte strings.
- Dynamic length byte strings
- Homogenous arrays
- Heterogenous struct-likes
It's worth noting the following nuances about data types.
- Support for fixed-length byte strings allows
- adding support for unsigned integers of any size
- adding support for booleans
### On "Indexing"
A full explanation of this can be found here: https://gist.github.com/karalabe/3a25832b1413ee98daad9f0c47be3632
An optimal serialization scheme would allow:
- `O(1)` lookups for elements which are statically sized and who's outer data
containers are also statically sized.
- `O(log(N))` lookups for elements contained within dynamically sized data
containers.
## Candidates
Here is the evaluation of various serialization schemes I've looked at.
- Protobuf: https://developers.google.com/protocol-buffers/
- MessagePack: https://msgpack.org/
- CBOR: http://cbor.io/
- BSON: TODO: http://bsonspec.org/spec.html
- RLP: https://github.com/ethereum/wiki/wiki/RLP
- SSZ: https://github.com/ethereum/eth2.0-specs/blob/bed888810d5c99cd114adc9907c16268a2a285a9/specs/simple-serialize.md
- SOS: https://gist.github.com/karalabe/3a25832b1413ee98daad9f0c47be3632
- SSS: https://github.com/ethereum/bimini/blob/7c26efec585742ef870bf58ea5d96e2deb242775/spec.md
### Some popular formats that don't suit our needs
Protobuf, MessagePack, and CBOR suffer from the same shortfals
- No support for integers above 64 bits
- No support for fixed size byte strings
- No support for fixed size arrays
- The *hashing* requirement for eth2.0 would have to be implemented at a second layer.
> CBOR does support arbitrary bignums but no defined bit sizes above 64-bit
> CBOR contains extraneous metadata in it's serialization scheme that isn't needed by our protocol.
### RLP
https://github.com/ethereum/wiki/wiki/RLP
#### Compactness
RLP can be efficient under the condition that the data structure has relatively few elements. The *extraneous* bytes come from:
- need for additional length bytes for fixed length data structures
- fixed length byte strings
- fixed size integers
- lists of known length
- structs
Experiments show that this inefficiency increases serialized sizes by 5% on most Eth1 data structures, but can be as much as 40% on Eth2 data structures.
#### Expressiveness
RLP only has first class support for dynamic length byte strings and dynamic
sized lists of dynamic length byte strings. All of the additional data types
are supported via addidional abstraction layers provided by the various RLP
libraries.
This allows RLP to support for arbitrary data types but at the cost of a few
extra bytes for length prefixes.
#### Hashing
The current approach to hashing RLP objects does not provide the performance
gains needed for the Eth2.0 network.
However, it should be easy to define an algorithm similar to [SSZ trie
hash](https://github.com/ethereum/eth2.0-specs/blob/bed888810d5c99cd114adc9907c16268a2a285a9/specs/simple-serialize.md#tree-hash)
that works well for RLP objects.
#### Indexing
RLP does not lend itself well to fast indexing into encoded data structures.
can approach `O(N)` for long flat data structures.
### SSZ
https://github.com/ethereum/eth2.0-specs/blob/bed888810d5c99cd114adc9907c16268a2a285a9/specs/simple-serialize.md
#### Compactness
SSZ is not efficient. This inefficiency seems to be largely due to the 4-byte
length prefixes it uses for dynamic sized data structures and the length
prefixes it uses for containers which are not needed.
Measurements of Eth1.0 data structures showed SSZ serialized values being
30-50% larger than their RLP equivalents.
#### Expressiveness
SSZ supports all of the needed data types.
#### Hashing
SSZ defines [trie
hash](https://github.com/ethereum/eth2.0-specs/blob/bed888810d5c99cd114adc9907c16268a2a285a9/specs/simple-serialize.md#tree-hash)
which allows for efficient re-hashing of objects with minor modifications.
#### Indexing
SSZ allows for reasonably fast indexing. It contains some inefficiency here which my intuition says... *TODO*
### SOS
https://gist.github.com/karalabe/3a25832b1413ee98daad9f0c47be3632
The SOS scheme has not been deeply researched, nor have firm numbers been
measured to compare it to RLP os SSZ so the following is based largely on
intuition.
#### Compactness
SOS should be on par or slightly more efficient than SSZ.
The inefficiency of SOS should be attributable to the 4-byte offset indexes it
uses to reference the locations of the dynamic elements.
#### Expressiveness
SSZ supports all of the needed data types.
#### Hashing
SOS does not define a fast hashing scheme but [SSZ trie
hash](https://github.com/ethereum/eth2.0-specs/blob/bed888810d5c99cd114adc9907c16268a2a285a9/specs/simple-serialize.md#tree-hash)
should be easily converted to work with SOS
#### Indexing
SOS provides reliably fast indexing
### SSS
https://github.com/ethereum/bimini/blob/7c26efec585742ef870bf58ea5d96e2deb242775/spec.md
This is an experimental scheme that I came up with
#### Compactness
SSS is the most compact of all of the schemes here. This is primarily achieved
through the use of
[LEB128](https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128) encoding of
integer values that are known to normally be well under their maximum bit
width.
SSS is able to gain roughly
- 5% efficiency vs RLP
- 30-50% efficiency vs SSZ
#### Expressiveness
SSS supports all of the needed data types.
#### Hashing
SSS does not define a fast hashing scheme but [SSZ trie
hash](https://github.com/ethereum/eth2.0-specs/blob/bed888810d5c99cd114adc9907c16268a2a285a9/specs/simple-serialize.md#tree-hash)
should be easily converted to work with SSS
#### Indexing
SSS does not provide fast indexing.
## Where we go from here
Here is a table that summarizes the findings above.
| | compact | expressiveness | hashing | indexing |
|--------|----------|----------------|-----------|-----------|
| RLP | yes | flexible | possible | no |
| SSZ | no | yes | yes | poor |
| SOS | no | yes | possible | yes |
| SSS | yes | yes | possible | no |
### Option A: Two Serialization Formats
One option is for us to use two formats.
- RLP for *Networking*
- SSZ for *Consensus*
Using two different serialization schemes beneficial in that it appears that
*indexing* and *compactness* appear to be somewhat mutually exclusive and that
the existing schemes for proviing fast *indexing* preclude support for
streaming encoding and decoding.
We can improve this approach by bringing the offset format from SOS into SSZ.
We can gain some efficiency at the networking level by modifying our use of RLP
to include support for LEB128 encoded integers.
### Option B: RLP only
The simplest option is to define trie-hash for RLP and accept the loss of fast
indexing at the *Consensus* level and streamability when encoding decoding. We
get a single established serialization scheme which is flexible and well known
and serves all of our *major* requirements adequately.
### Option C: Hybrid SSZ/SOS/SSS
As it stands, to pick from any of the proposed options would mean making a
meaningful sacrifice in at least one of the evaluation criteria. I believe
however that we *can* combine these formats to a single format which serves all
of our needs.
I propose the following high level approach
- Trie Hashing from SSZ
- Indexing from SOS
- Efficiency from SSS
This could be a *new* scheme or a modification to SSZ. Rough details of how this would work are:
We split the concept of encoding into two concepts.
- **value** encoding:
- The encoded format without any metadata such as length prefixes
- **structure** encoding:
- The manner in which the structure is encoded, including both length prefixes and the layout of the encoded values.
Everything would have a single scheme for *value* encodings.
For *structure* encoding, we would define two schemes.
- **ABI**: The highly efficient SOS style layout.
- LEB128 length prefixes
- Dynamic data encoded inline
- **WIRE**: The highly compact SSS style layout.
- Dynamic data referenced by offset and located at the end
- Fixed size length prefixes (if applicable)
- LEB128 encoded values get padded to full byte width
The *Trie Hash* algorithm would be defined only for the *ABI* layout.
## Conclusion
**Option A** is *good* because is takes the least work to implement, howeve it has the *downside* of leaving us with two different schemes.
**Option B** leverages existing infrastructure, requires some non-trivial changes to the Eth2.0 spec, and requires us to make sacrifices on many of the evaluation criteria.
**Option C** builds on the current Eth2.0 spec but with non-trivial changes should result in a single serialization scheme that is highly optimal across all criteria.
My personal preferences in order of most preferred to least are:
- Option C
- Option A
- Option B
## On Compression
One question which has come up is roughly:
> Can we choose a *single* serialization format that compromises on *compactness* (the serialized format is not *highly compact*) and make up those losses using compression (e.g. snappy).
My experiments suggest that we *can* get improvements from using compression but that SSS style serialization still results in ~8% reduction in final size.
- SSS+snappy vs SSZ+snappy
- 8% size reduction for attestations
- 9.5% size reduction for beacon blocks
- 4% size reduction for beacon headers
## Lots of raw data
Here is the latest raw data in comparing RLP, SSZ, and SSS
https://gist.github.com/pipermerriam/449b0425a96197a731ed41f8f77ed42f