WS sync in practice

# WS sync in practice Many consensus-layer clients have recently integrated weak subjectivity (WS) sync into their clients. This is a major win, not only in the super fast UX, but also in that it is critical to proof-of-stake sync security. That is, it is not safe to sync from genesis without also having a recent piece of information about the network (e.g. a finalized hash, block, or state), *and* because Beacon States are relatively small and bound in size (e.g. ~40MB), it is quite simple to just get the full beacon state out of band rather than relying on a complex p2p state sync protocol. > Note 1: This sync mechanism is only critical when inititally bootstrapping the network *or* when your node has been offline for too long (e.g. 1+ month). > Note 2: This sync is distinctly different than the ~60GB mainnet Ethereum application state. (1) Once the consensus layer (either PoS or PoW) is verified, you can safely download the state from peers, and (2) just downloading that larger state from *somewhere* is not simple and instead getting it from many peers and over the course of days is a much more reasonable solution. Okay, so we need a recent state (or at least hash to join the network safely, and if we get that recent state out of band, it's quite nice because our node starts running in [matter of minutes](https://twitter.com/ajsutton/status/1441576184634953732) instead of days. But how do we get a state? Is this dangerous? What can go wrong? And how can we make it better? [toc] ### How to WS sync The most cypherpunk strategy to get a state out-of-band is to find a friend (or three) that you know runs an Ethereum mainnet node, get a state dump, load it in your machine, and be on your merry way. Unfortunately, this likely won't work for most users beacause (1) they don't have any friends or (2) this is just difficult, confusing, or not automated enough. While we can't solve for (1), we can and have begun to attempt to solve for (2). That is, instead let's create a URI standard to download recent finalized states from a third party. The idea is that if I trust that third party (enough) I can bootstrap safely onto the network. In theory this is nice but in practice, there are probably just a few of entities (e.g. infura, etherscan, etc) we might download from. A few questions: 1. can we do better? 2. what sort of damage can a security breach with the centralized entity do? 3. are there advanced heuristics/techniques we can employ? ### Can we do better? #### N participant checkpoint download The strawman proposal above relies upon asking a single entity for a recent state from which to bootstrap the network. The obvious method to reduce both liveness and safety risks is rely not upon 1 entity but instead to query N entities. In the event that any one entity (or some threshold) disagrees, throw an error to the user that a checkpoint state cannot be safely downloaded automatically and that the user needs to investigate and manually intervene. To make the above better with respect to requisite bandwidth, have WS checkpoint providers respond to two endpoints -- `checkpoint_root(epoch)` and `checkpoint_state(epoch)`. Query N participants for `checkpoint_root` and if they agree, download the state once via `checkpoint_state`. Then compute `hash_tree_root(checkpoint_state)` to ensure that it is in fact the root previously multi-queried. Additionally, conforming on intervals of states to serve will help providers -- e.g. only serve `epoch % 256`. #### NxN boostrapping The key to the above is that the user double (triple or N-tuple) checks a WS checkpoint against multiple sources for social consensus before bootstrapping the network. Because any one source of a recent checkpoint might be corrupted, the general rule we can follow is to query more **methods** and many **sources** from each step (thus NxN) to increase the likelihood that our social/subjective view of the network is correct. In the event that a certain threshold of methods or sources within a method disagree, then alert the user that automated bootstrap methods failed and they need to investigate and/or manually override (*read: check twitter and figure out what is going on*). Increasing number of sources per method is generally straight forward, but what are the methods at our disposal? * Manual input (e.g. I got a root from a twitter account or friend I trust) * Checkpoint providers * Bootstrap nodes * Nodes in general (discovered beyond bootstrap nodes) None of the above should be considered necessarily more accurate than another, but instead the methods should be considered as additive in an aggregate subjective view of the network the user is intending to join. Bootstrapping via weak subjectivity can then ideally look something like: 1. [Bonus] Insert a root from a friend via the command-line `--ws 0x{ROOT}:{EPOCH}` 2. Query 5 different WS checkpoint providers via `checkpoint_root(recent_epoch)`. Check that these roots match eachother (and if performed [1] check they match the cli root). Abort if match fail. Download state via `checkpoint_state(recent_epoch)` if success. 3. Software connects to p2p network via a hardcoded set of bootnodes. Software queries the bootnodes for the root at the `recent_epoch` from [1] and [2]. If any mismatch, abort. Else, continue 4. Software connects to more nodes from the original bootnodes. Software queries less-trusted nodes for the root at `recent_epoch`. If some threshold mismatches, abort. Note, as we get to a less restricted set to query, we need to be careful as the view from the network can be sibyl'd. [4] might surface as a warning that is easy to bypass rather than a strict abort. *Note*: It has come to my attention that bootstrap nodes are often "full" and thus cannot be connected to on libp2p, additionally, some bootstrap nodes are not in fact full nodes and just discv5 services. (1) Many are full nodes so that's probably okay, (2) clients should really consider a different target and max peer config so that nodes don't get rigid with their peers (e.g. max 60, targt 50, will accept inbound above 50 and prune down to 50 over time), and (3) we could also put an optional field in the ENR `wsr` (weak-subjectivity-root) which is just the first 2-bytes of the most recent `epoch % 256` checkpoint root. ### What sort of damage can occur? There are two primary failures that can happen if WS checkpoint sync is attacked -- liveness and safety. #### Liveness failures A liveness failure is one in which the method to bootstrap the WS checkpoint failed to produce a trusted output and thus your node cannot (at least temporarily) sync to mainnet. Liveness failures are bad but likely will not result in loss of (significant) amounts of funds. For normal users, there is opportunity cost in not being able to join the network quickly, and for validators, the validator might be subject to offline penalties during the time in which they cannot manage to sync the network. #### Safety failures A safety failure is one in which the method to bootstrap the WS checkpoint produces an *incorrect* output. That is -- a checkpoint state that is *not* on the canonical mainnet chain. These failures can be much more serious as they can cause a warped view of the state of the chain upon which the user makes financial decisions. Fortunately, for a validator in isolation (not 1/3 by stake weight), the validator cannot get "stuck" on this nefarious chain. If the validator (1) retains their anti-slashng DB and then (2) manages to sync the canonical chain, the validator will safely be able to rejoin mainnet and continue validation having only suffered small offline penalties while on the "incorrect" chain. For a user who may be sending arbitrary TXs based on their percieved state of the chain, the range of what can go wrong can in some scenarios be much more dire. I won't enumerate all of the potential scenarios, but a sophisticated enough attacker that has tricked a user into a false WS checkpoint can make a huge impact if the user is tricked into signing messages which may have a very different affect on mainnet. In practice in many scenarios, both tricking a user into an incorrect WS checkpoint *and* convincing that same user to send particular TXs based on that state may be far fetched, but nonetheless scary and impactful when successful. #### Prefer safety over liveness in initial sync Because the impact of a WS safety failure on a user has unbounded theoretical impact, WS sync protocols should be designed with many safety checks (see NxN methods) and prefer to fail fast and loud, requiring manual intervention, rather than silently attempting to patch itself. ### Advanced techniques/heuristics Are there more advanced techniques and heuristics to further reduce the chance of a safety failure in WS sync? The answer is yes, probably. But the followup question is are they worth the complexity? To expound upon the potential domain, one such advanced technique might be to do a chain analysis of user activity to assess whether it is sufficiently "organic". The idea being that an attacker chain may be less diverse/novel/etc than mainnet proper. So maybe we can employ analytic techniques on mainnet history to then assess whether a freshly sync'd chain seems correct. If sketchy, abort. Such an idea seems sound in principle, but (1) with sufficient NxN bootstrapping, is it worth the complexity and (2) once you have such an algorithm written down, what are the chances the attacker can game it? Because of (2) any such advanced technique (similar to NxN bootstrapping) becomes additive instead of sufficient alone. Thus although it might be worthwhile to explore these techniques, we do not expect them to replace NxN bootstrapping outright.