Pectra Testnets Incidents Reports

# Pectra Testnets Incidents Reports ## Besu ### Any notable highlights from your team on Holešky incident? Please share if you kept any notes or postmortem https://hackmd.io/@siladu/H1qydmWhyx ### What client issues have you encountered during the period of long non-finality? We were running 5 CLs (except Grandine) along with Besu. All had issues at some point. We kept our besu-lighthouse node producing blocks when Holesky was suffering from its worst liveness issues. During recovery, our 5 nodes became VC-only nodes and we pointed them to various recovery beacon nodes which were hotfixed and synced separately. Sometimes we used other teams' beacons as well. Keeping the beacon nodes alive has been a struggle but things have been more stable in the couple of days. A lesson here is that it's handy to have the VC separated out (teku was combined). Another lesson is that maintaining 5 different configurations is time consuming when issues occur. For Holesky (and for Hoodi) we are going to move towards a beacon node setup of two larger nodes: one teku and lighthouse, which seem to have been the most stable clients for us. We also found and fixed an issue related to Besu snap sync in periods of non finality: https://github.com/hyperledger/besu/issues/8393 ### Any issues you have encountered after finalization on Holešky? We were running majority hotfixed lighthouse beacons at the time which suffered from an issue upon finality, but it was easy to resync, and that is now also fixed. Participation has been on the low side since finality which has made things more unstable than before the incident. We recently discovered an issue in our infra that was making things more unstable: our VMs were using ntp for time synchronization. Switching to chrony seems to be a notable improvement. Network conditions on Holesky brought this inaccuracy to light, we didn't notice significant issues before. ### Any notable highlights from your team on Sepolia incident? Please share if you kept any notes or postmortem Even if not deployed, we were able to quickly build a Besu plugin to temporarily ban txs that could be considered attack vectors targeting the issue and preventing including txs in proposed blocks, that extend the scope of plugins to deploying quick and temporary hot-fixes, without the need of a full release, a new use case that was not considered before. ### Apart from what has been discussed in interop already, have your team identified any specific measures to prevent similar issues in the future? Yes we are reviewing usage of 3rd party libraries in critical consensus paths and replacing them with code directly under our control, see https://github.com/hyperledger/besu/issues/8391 Also, as mentioned in https://hackmd.io/@siladu/H1qydmWhyx: - Don't rely on defaults, fail early instead - Adopt eth_config https://hackmd.io/@shemnon/eth_config Would be good to work towards using a shared config between ELs (eth_config might mitigate this anyway). ### Any more thoughts you'd like to share in the general document? We found that moving validators keys from one validator to another is not trivial and too manual, unless there are tools around that I could not find, and loading thousands of keys takes a lot of time, so we re-encrypted them with lower security using https://github.com/usmansaleem/v4keystore_converter ## go-ethereum ### Any notable highlights from your team on Holešky incident? Please share if you kept any notes or postmortem We had a config issue where Holesky and Sepolia did not have deposit addresses set, leading to the holesky incident. This issue was fixed here: https://github.com/ethereum/go-ethereum/pull/31247 ### What client issues have you encountered during the period of long non-finality? Some clients forced geth nodes to reorg below our snap sync point which caused geth to crash: https://github.com/ethereum/go-ethereum/issues/31320 ### Any issues you have encountered after finalization on Holešky? Not with geth ### Any notable highlights from your team on Sepolia incident? Please share if you kept any notes or postmortem We errored out on all logs from the deposit contract with a size not equal to 576, which caused us to produce empty blocks on Sepolia, since the transactions in the mempool emitted events with log size equal to 32. This was fixed here: https://github.com/ethereum/go-ethereum/pull/31317 We also noticed that we did not have a override flag for Cancun yet, fixed here: https://github.com/ethereum/go-ethereum/pull/31341 ### Apart from what has been discussed in interop already, have your team identified any specific measures to prevent similar issues in the future? We have --synctarget in order to force our client onto a certain chain. This has to be specified on startup though, so it would be good to add it, see also https://github.com/ethereum/go-ethereum/issues/31375 ### Any more thoughts you'd like to share in the general document? (personal opinion of me, Marius) I think the incident response in both cases of the EL client teams was pretty good. I feel like there is a lot of coordination and cooperation missing on the CL. CL teams all worked in their own silos without notifying others about the issues and fixes they are working on. Also big props to the EthPandaOps for being online non stop two days in a row ## Grandine ### Any notable highlights from your team on Holešky incident? Please share if you kept any notes or postmortem The main issue was that we never tested Grandine under such long non finality. We should be way more ready next time. ### What client issues have you encountered during the period of long non-finality? We mainly faced increased memory used that often led to OOM. ### Any issues you have encountered after finalization on Holešky? No. ### Any notable highlights from your team on Sepolia incident? Please share if you kept any notes or postmortem We do not run Sepolia validators. ### Apart from what has been discussed in interop already, have your team identified any specific measures to prevent similar issues in the future? We need regular non-finality networks. My idea would be to use Holesky or the network as the new large non-finality network that goes to non-finality regularly and not finalizes for weeks. ### Any more thoughts you'd like to share in the general document? My key item would be long non-finality networks as a standard testing procedure. ## Teku ### Any notable highlights from your team on Holešky incident? Please share if you kept any notes or postmortem By far our biggest challenge was to spin up a new node and re-sync the chain during the non-finality issue. We ended up identifying a few things to improve in our syncing strategy that made it better. We also created a "superbeacon" node, with heaps of CPU/RAM to be able to handle longer periods of non-finality etc. This ended up being a good strategy and, once we managed to sync the node, it was mostly carrying the load of all our VCs (100k keys). The decision for us to change our infrastructure to separate BN from VC helped us greatly during non-finality, because when a node was unable to sync we could simply change where the VC was attempting to perform its duties. Also other teams providing BN access for short term use was very helpful. ### What client issues have you encountered during the period of long non-finality? - Fork Choice bug (probably caused by wrong handling of equivocating votes) - Sync issues during long non finality: when node restart, protoArray initialises with 0 weights => canonical head becomes a random chain tip in the past causing sync process to restart from an old block) (fixed) - Node too easily decides to restart syncing from last finalized state (fixed) - Slow block production due to too many single attestations to deal with (fixed) - Several noisy log messages cleaned up (made things harder to diagnose) - Bad attestation selection for aggregation fixed by adding a round of sorting (can be improved further) ### Any issues you have encountered after finalization on Holešky? No ### Any notable highlights from your team on Sepolia incident? Please share if you kept any notes or postmortem Not a lot of involvement in the Sepolia incident given it was mostly handled by EL teams. ### Apart from what has been discussed in interop already, have your team identified any specific measures to prevent similar issues in the future? - We are re-thinking our deployment infrastructure for production-like environments to make them more resilient to incidents and downtime. We have identified that we should treat Holesky/Sepolia/Hoodi as a staging environment, and make sure we have an incident response plan for handling incidents. - We considered some functionalities to make Teku sync from a non-finalized state as other CLs have done but not the highest priority at the moment since our codebase heavily relies on a finalized source to start. Maybe it is worth a broader discussion around the requirements and expectations for each testnet. Lucas has written a piece with some reflections on it: https://hackmd.io/@lucassaldanha/rJd-9rAikg ## Lighthouse ### Any notable highlights from your team on Holešky incident? Please share if you kept any notes or postmortem Michael promptly shut down our validators, avoiding attestation to the invalid chain. We added a `lighthouse/add_peer` endpoint to help nodes find canonical chain peers, especially useful with `--disable-discovery`. Beacon state bloat occurred due to inactivity penalties affecting `inactivity_scores`, `validators`, and `balances` fields. This caused large epoch boundary state diffs. State cache improvements included: - Protecting head block states from pruning - Smarter caching of boundary states - Prioritized pruning strategy - Added control flag - Fixed hot-to-cold DB migration issues - Reduced default cache (256→32) We implemented "pseudo finalization" to improve disk efficiency and help with sync rejection of incorrect chains. `BlocksByRange` was optimized to load from fork choice when possible. The new `--invalid-block-roots` flag allowed automatic invalidation of problematic blocks (like 2db899...). ### What client issues have you encountered during the period of long non-finality? #### Syncing Issues During the Holesky incident, syncing was difficult as most peers served chains descending from a bad block, making it hard for Lighthouse to find peers following the canonical chain. #### Disk Space Issues Lighthouse stores data in two databases: a disk-efficient cold DB (finalized data) and a hot DB (non-finalized data without diff algorithms). Two major problems emerged: - Without finalization, data couldn't migrate from hot to cold DB - Beacon states ballooned to ~180MB each, stored directly in hot DB Normal disk usage of ~60GB exploded to over 1TB in some instances. #### Memory Issues The default LRU beacon state cache holds 256 states, each ~180MB during the incident. Nodes frequently ran out of memory as a result. `BlocksByRange` requests loading and writing states to the cache worsened OOM issues. #### State Cache Misses During non-finality with multiple side chains, the cache became flooded with unhelpful states, further degrading performance. ### Any issues you have encountered after finalization on Holešky? Pseudo-finalization had an effect on block validation that we missed. This was revealed after Holesky finalized. We have a "split block" which is the block that marks the split between freezer database and hot database storage. The split block had always been equal to or behind the finalized block, but with manual finalization the split block can be ahead of the finalized block. This broke a block validation check we had that relied on the split block always being older than or equal to finalized block. ### Any notable highlights from your team on Sepolia incident? Please share if you kept any notes or postmortem No ### Apart from what has been discussed in interop already, have your team identified any specific measures to prevent similar issues in the future? ### Hot tree states Our team has been working on a feature called "hot tree-states". In a nutshell it will allow us to store data in our hot DB in a disk efficient manner. We already have this feature in the cold DB. In periods of non-finalilty, when Lighthouse can't migrate data from the hot DB to the cold DB, and beacon states take up significant disk space, hot tree-states will prevent Lighthouse from consuming an inordiante amount of disk space. ### Any more thoughts you'd like to share in the general document? Small stakers with minimal hardware setups will likely drop out and struggle to stay online. We could have their VC point to a healthy BN, similar to what the rescue nodes are doing. This will help keeping more validators online, aiming to achieve finalization. We could have a mechanism similar to checkpoint sync that is on-going. Essentially a leader telling you the chain to follow at unfinalized checkpoints so we can prune invalid chains. Ethereum prioritises liveness over safety. In the case of chain splits, blocks will continued to be produced, but with many forks. However, in practice, when end users want to do transactions, they pretty much will use the chain that major UI (Metamask or some wallets) follow, but these UI could be on the invalid chain intentionally/unintentionally. This could led to some controversial outcomes. Some non-finalization plan on this could help mitigate this in the future. ## Lodestar ### Any notable highlights from your team on Holešky incident? Please share if you kept any notes or postmortem We've documented and will be publishing a blog about it. The draft is located here: https://hackmd.io/@philknows/ByxcAAWnye ### What client issues have you encountered during the period of long non-finality? Covered in: https://hackmd.io/@philknows/ByxcAAWnye ### Any issues you have encountered after finalization on Holešky? Not a lot of issues after recovering on Holesky. We were running some experiments and accidentally slashed about 2000 of our genesis validators due to some mis-coordination between the team (not related to network). Otherwise, we were able to sync pretty easily, especially after PandaOps gave us the finalized checkpoint. ### Any notable highlights from your team on Sepolia incident? Please share if you kept any notes or postmortem There were no real issues from the Sepolia side. We did have watchtower running on our sepolia node. So we accidentally upgraded the geth:latest as soon as they published the fix. This just caused us to miss blocks as our geth included the fix about 1-2 hours before the coordinated upgrade. Once we downgraded back to the pre-fixed version, we proposed blocks again on the network without any issue. We were able to coordinate the upgrade alongside other node operators without issue on the call. ### Apart from what has been discussed in interop already, have your team identified any specific measures to prevent similar issues in the future? All documented in: https://hackmd.io/@philknows/ByxcAAWnye ### Any more thoughts you'd like to share in the general document? We will be publishing this retrospective on the ChainSafe blog for future reference. ## Prysm ### Any notable highlights from your team on Holešky incident? Please share if you kept any notes or postmortem This is a pr that will represent our public postmortem https://github.com/prysmaticlabs/documentation/pull/1028 it has a few things paraphrased or removed to respect confidentiality. There are some lessons learned - don't delete the EL dbs too early. We deleted the EL dbs too early which set us back on syncing and working on hacks to get the chain to stay alive on the minority correct chain - Ran into memory issues with pod allocation and running into out of space issues - Had to add hacks in Prysm to reject bad block manually and prevent resyncs https://github.com/prysmaticlabs/prysm/compare/develop...hackSync - Random restarts in our infrastructure affected our liveliness stability strategy and set us back. Doesn't see to have a good way to avoid this with kubernetes pods. we ran into lots of issues that can be described in 3. ### What client issues have you encountered during the period of long non-finality? - We had more difficulties syncing because our architecture has difficulty handling periods of 2000+ empty blocks. Our legacy range sync code was buggy, because the design assumes large sequences of skipped slots are a sign of a dead-end fork, and it relied on a deprecated protocol feature (BeaconBlocksByRange step parameter) to explore other forks. - Uncovered old GetDuties REST API bug, calling the committee's endpoint blows up the performance. we killed the nodes running the REST API after that. This bug also prevented us & ethpandaops from pointing validators to beefy correctly synced prysm/geth combinations. - We had an issue with handling attester slashing and failing gossip validation pathing Pull Request #14985 - Slasher did not work correctly post electra, we also have no way to provide the wrong fork data to the slasher for proper slashing post event, long term solutions are being looked at. we have other issues that we can't list due to character limit ### Any issues you have encountered after finalization on Holešky? we saw `could not process attestation: bitfield length 856 is not equal to committee length 855","message":"Could not process attestation for fork choice…` which means we did not completely solve all our attestation issues ### Any notable highlights from your team on Sepolia incident? Please share if you kept any notes or postmortem we also saw {"error":"rpc error: code = Internal desc = handle block failed: submit blinded block failed: error posting the blinded block to the builder api: unsupported error code: 502: did not receive 200 response from API","message":"Failed to propose block","prefix":"client","pubkey":"0xa019370ca799","severity":"ERROR","slot":7149132} on our only builder enabled sepolia node. might need more investigation. ### Apart from what has been discussed in interop already, have your team identified any specific measures to prevent similar issues in the future? There's still a lot to do and discuss, we've attempted to add new features, add fixes, and open issues to track features that could be added to help us recover more quickly or debug more quickly in a future event added sync from head feature and flag https://github.com/prysmaticlabs/prysm/pull/15000 with https://github.com/prysmaticlabs/prysm/pull/15006 bug fix. added several new issues to plan for future features - https://github.com/prysmaticlabs/prysm/issues/14988 - https://github.com/prysmaticlabs/prysm/issues/14989 - https://github.com/prysmaticlabs/prysm/issues/14987 - https://github.com/prysmaticlabs/prysm/issues/14986 - https://github.com/prysmaticlabs/prysm/issues/14994 Fixes and proposals to attestations since the event, still possibly some more to come - https://github.com/prysmaticlabs/prysm/pull/15027 - https://github.com/prysmaticlabs/prysm/pull/15028 - https://github.com/prysmaticlabs/prysm/pull/15018 - https://github.com/prysmaticlabs/prysm/pull/15034 also#15024/#14990 ### Any more thoughts you'd like to share in the general document? I wish i could have pasted more in this document ## Nimbus ### Any notable highlights from your team on Holešky incident? Please share if you kept any notes or postmortem For this purpose, the will be focusing on the Nimbus CL and its interactions with the various ELs. ### What client issues have you encountered during the period of long non-finality? For Nimbus, non-finality itself in Holesky was largely a non-issue. Neither memory usage nor disk space usage suffered. The sync performance of the default branches was not optimal and we switched to feat/splitview to better track forks. Also, Nimbus chose not to in either feat/splitview or the default branches manually whitelist or blacklist any blocks, and this ran into a couple of issues: 1) Because of the sequence of, (a) newPayload returning syncing, (b) Nimbus adding the block to its fork choice, and (c) fork choice saying that block is, actually, `INVALID`, but now that block is already in fork choice and justified, it creates a tricky to recover from scenario, due to the fundamentally optimistic nature of how the engine API works. 2) Similarly, while the `feat/splitview` branch was able to effectively find/explore lots of forks on from different nodes, it was unable to get ELs to often respond with anything but `SYNCING`, so couldn't rule out actually-`INVALID` forks. ### Any issues you have encountered after finalization on Holešky? After Holesky finalized, the Nimbus CL took a while to finish some on-finalization processing it did and disrupted slot and block processing for a while. Once it got past that, it was fine, and `feat/splitview` wasn't useful or necessary anymore. ### Any notable highlights from your team on Sepolia incident? Please share if you kept any notes or postmortem For Sepolia, it was for the Nimbus CL a largely non-event. We didn't have to release any new versions for it, and it was an essentially EL-oriented event. The Nimbus EL incorporated the necessary fixes. ### Apart from what has been discussed in interop already, have your team identified any specific measures to prevent similar issues in the future? Better configuration testing and improving the e.g., the checkpoint recovery has been discussed, so less discussed has been nonfinality testing. This has happened only accidentally (Goerli and now Holesky), and revealed serious problems with both CLs and ELs both times. This could be done more regularly.

Read more

Dev Update Week 0 (example)

Ephemery incentivisation program

Ethereum on PS4

EPF - Guide for mentors