Everything You Never Wanted To Know About Serialization

# Everything You Never Wanted To Know About Serialization I've been researching and thinking about serialization for a few weeks now. This document tries to: * Capture what I've learned * Suggest reasonable courses of action. ## Problem Statement When talking about serizalization needs we will be looking at the following two primary use cases. - **Networking**: The act of transporting objects between clients across a network. - **Consensus**: The process of manipulating objects within the logic of the protocol We wish to determine what scheme is appropriate for these use cases in the context of the Eth2.0 network, with consideration taken towards the Eth1.0 network. The Eth1.0 protocol uses RLP for each of these. The Eth2.0 protocol uses SSZ for the *Consensus* use case and does not have a defined scheme for the *Networking* use case. ## Evaluation Criteria In order to make a decision, we must know what the different criteria under which a serialization scheme should be evaluated. For the *Networking* use case I pose the following: TODO: Schemaless at the networking level - **Major**: - **compactness**: How space-efficient is the serialized (bytes) representation of the data. - **expressiveness**: Does the serialization support the data types we use. - **Minor**: - **streamability**: Does the serialization support streaming encoding and decoding For the *Consensus* use case I pose the following criteria. - **Major**: - **expressiveness**: Does the serialization support the data types we use. - **hashing**: Can data structures be efficiently hashed and re-hashed after minor modifications. - **Minor**: - **indexing**: The act of accessing the inner values of a data structure without fully deserializing it. ### On "Expressiveness" of data types This document works under the assumption that we **require** support for the following data types. - Booleans - Unsigned Integers: - 64 bits for beacon chain - 128 bits for shard chain - 256 bits for eth1.0 chain - Fixed length byte strings. - Dynamic length byte strings - Homogenous arrays - Heterogenous struct-likes It's worth noting the following nuances about data types. - Support for fixed-length byte strings allows - adding support for unsigned integers of any size - adding support for booleans ### On "Indexing" A full explanation of this can be found here: https://gist.github.com/karalabe/3a25832b1413ee98daad9f0c47be3632 An optimal serialization scheme would allow: - `O(1)` lookups for elements which are statically sized and who's outer data containers are also statically sized. - `O(log(N))` lookups for elements contained within dynamically sized data containers. ## Candidates Here is the evaluation of various serialization schemes I've looked at. - Protobuf: https://developers.google.com/protocol-buffers/ - MessagePack: https://msgpack.org/ - CBOR: http://cbor.io/ - BSON: TODO: http://bsonspec.org/spec.html - RLP: https://github.com/ethereum/wiki/wiki/RLP - SSZ: https://github.com/ethereum/eth2.0-specs/blob/bed888810d5c99cd114adc9907c16268a2a285a9/specs/simple-serialize.md - SOS: https://gist.github.com/karalabe/3a25832b1413ee98daad9f0c47be3632 - SSS: https://github.com/ethereum/bimini/blob/7c26efec585742ef870bf58ea5d96e2deb242775/spec.md ### Some popular formats that don't suit our needs Protobuf, MessagePack, and CBOR suffer from the same shortfals - No support for integers above 64 bits - No support for fixed size byte strings - No support for fixed size arrays - The *hashing* requirement for eth2.0 would have to be implemented at a second layer. > CBOR does support arbitrary bignums but no defined bit sizes above 64-bit > CBOR contains extraneous metadata in it's serialization scheme that isn't needed by our protocol. ### RLP https://github.com/ethereum/wiki/wiki/RLP #### Compactness RLP can be efficient under the condition that the data structure has relatively few elements. The *extraneous* bytes come from: - need for additional length bytes for fixed length data structures - fixed length byte strings - fixed size integers - lists of known length - structs Experiments show that this inefficiency increases serialized sizes by 5% on most Eth1 data structures, but can be as much as 40% on Eth2 data structures. #### Expressiveness RLP only has first class support for dynamic length byte strings and dynamic sized lists of dynamic length byte strings. All of the additional data types are supported via addidional abstraction layers provided by the various RLP libraries. This allows RLP to support for arbitrary data types but at the cost of a few extra bytes for length prefixes. #### Hashing The current approach to hashing RLP objects does not provide the performance gains needed for the Eth2.0 network. However, it should be easy to define an algorithm similar to [SSZ trie hash](https://github.com/ethereum/eth2.0-specs/blob/bed888810d5c99cd114adc9907c16268a2a285a9/specs/simple-serialize.md#tree-hash) that works well for RLP objects. #### Indexing RLP does not lend itself well to fast indexing into encoded data structures. can approach `O(N)` for long flat data structures. ### SSZ https://github.com/ethereum/eth2.0-specs/blob/bed888810d5c99cd114adc9907c16268a2a285a9/specs/simple-serialize.md #### Compactness SSZ is not efficient. This inefficiency seems to be largely due to the 4-byte length prefixes it uses for dynamic sized data structures and the length prefixes it uses for containers which are not needed. Measurements of Eth1.0 data structures showed SSZ serialized values being 30-50% larger than their RLP equivalents. #### Expressiveness SSZ supports all of the needed data types. #### Hashing SSZ defines [trie hash](https://github.com/ethereum/eth2.0-specs/blob/bed888810d5c99cd114adc9907c16268a2a285a9/specs/simple-serialize.md#tree-hash) which allows for efficient re-hashing of objects with minor modifications. #### Indexing SSZ allows for reasonably fast indexing. It contains some inefficiency here which my intuition says... *TODO* ### SOS https://gist.github.com/karalabe/3a25832b1413ee98daad9f0c47be3632 The SOS scheme has not been deeply researched, nor have firm numbers been measured to compare it to RLP os SSZ so the following is based largely on intuition. #### Compactness SOS should be on par or slightly more efficient than SSZ. The inefficiency of SOS should be attributable to the 4-byte offset indexes it uses to reference the locations of the dynamic elements. #### Expressiveness SSZ supports all of the needed data types. #### Hashing SOS does not define a fast hashing scheme but [SSZ trie hash](https://github.com/ethereum/eth2.0-specs/blob/bed888810d5c99cd114adc9907c16268a2a285a9/specs/simple-serialize.md#tree-hash) should be easily converted to work with SOS #### Indexing SOS provides reliably fast indexing ### SSS https://github.com/ethereum/bimini/blob/7c26efec585742ef870bf58ea5d96e2deb242775/spec.md This is an experimental scheme that I came up with #### Compactness SSS is the most compact of all of the schemes here. This is primarily achieved through the use of [LEB128](https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128) encoding of integer values that are known to normally be well under their maximum bit width. SSS is able to gain roughly - 5% efficiency vs RLP - 30-50% efficiency vs SSZ #### Expressiveness SSS supports all of the needed data types. #### Hashing SSS does not define a fast hashing scheme but [SSZ trie hash](https://github.com/ethereum/eth2.0-specs/blob/bed888810d5c99cd114adc9907c16268a2a285a9/specs/simple-serialize.md#tree-hash) should be easily converted to work with SSS #### Indexing SSS does not provide fast indexing. ## Where we go from here Here is a table that summarizes the findings above. | | compact | expressiveness | hashing | indexing | |--------|----------|----------------|-----------|-----------| | RLP | yes | flexible | possible | no | | SSZ | no | yes | yes | poor | | SOS | no | yes | possible | yes | | SSS | yes | yes | possible | no | ### Option A: Two Serialization Formats One option is for us to use two formats. - RLP for *Networking* - SSZ for *Consensus* Using two different serialization schemes beneficial in that it appears that *indexing* and *compactness* appear to be somewhat mutually exclusive and that the existing schemes for proviing fast *indexing* preclude support for streaming encoding and decoding. We can improve this approach by bringing the offset format from SOS into SSZ. We can gain some efficiency at the networking level by modifying our use of RLP to include support for LEB128 encoded integers. ### Option B: RLP only The simplest option is to define trie-hash for RLP and accept the loss of fast indexing at the *Consensus* level and streamability when encoding decoding. We get a single established serialization scheme which is flexible and well known and serves all of our *major* requirements adequately. ### Option C: Hybrid SSZ/SOS/SSS As it stands, to pick from any of the proposed options would mean making a meaningful sacrifice in at least one of the evaluation criteria. I believe however that we *can* combine these formats to a single format which serves all of our needs. I propose the following high level approach - Trie Hashing from SSZ - Indexing from SOS - Efficiency from SSS This could be a *new* scheme or a modification to SSZ. Rough details of how this would work are: We split the concept of encoding into two concepts. - **value** encoding: - The encoded format without any metadata such as length prefixes - **structure** encoding: - The manner in which the structure is encoded, including both length prefixes and the layout of the encoded values. Everything would have a single scheme for *value* encodings. For *structure* encoding, we would define two schemes. - **ABI**: The highly efficient SOS style layout. - LEB128 length prefixes - Dynamic data encoded inline - **WIRE**: The highly compact SSS style layout. - Dynamic data referenced by offset and located at the end - Fixed size length prefixes (if applicable) - LEB128 encoded values get padded to full byte width The *Trie Hash* algorithm would be defined only for the *ABI* layout. ## Conclusion **Option A** is *good* because is takes the least work to implement, howeve it has the *downside* of leaving us with two different schemes. **Option B** leverages existing infrastructure, requires some non-trivial changes to the Eth2.0 spec, and requires us to make sacrifices on many of the evaluation criteria. **Option C** builds on the current Eth2.0 spec but with non-trivial changes should result in a single serialization scheme that is highly optimal across all criteria. My personal preferences in order of most preferred to least are: - Option C - Option A - Option B ## On Compression One question which has come up is roughly: > Can we choose a *single* serialization format that compromises on *compactness* (the serialized format is not *highly compact*) and make up those losses using compression (e.g. snappy). My experiments suggest that we *can* get improvements from using compression but that SSS style serialization still results in ~8% reduction in final size. - SSS+snappy vs SSZ+snappy - 8% size reduction for attestations - 9.5% size reduction for beacon blocks - 4% size reduction for beacon headers ## Lots of raw data Here is the latest raw data in comparing RLP, SSZ, and SSS https://gist.github.com/pipermerriam/449b0425a96197a731ed41f8f77ed42f

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.