# Scenario tests before Shapella mainnet ## Behavior that needs testing: 1. Clients build locally if relay doesnt' work 2. Clients build locally if their circuit breaker hits (X blocks missing in a row or some period of epochs) ## Tests 1. Testnet with 600k validator, 300k bls change lined up, and 100k exits lined up - Worst case testing 2. Relay/Builders testing implimentation on zhejiang: Happy case 3. Have mock/real Builders/Relays on a shadow fork + mev-boost running on 80% of vals, Take relay offline and test failover (Tests behavior 1. on shadow fork) 4. Non-finality during full deposit/withdrawal queue 5. Non-finality post-shapella with full deposit/withdrawal queue AND mev-boost active on >80%. Test the circuit breaker. (Tests behavior 2.) ## Testnet Plan | Testnet | Week of | Test it addresses | Notes | Status | |-----------------------------| ------------ | ----------------- |-------|-------| | `withdrawal-devnet-7` | 13th Feb | 1. | handled by Barnabas|Done| | `zhejiang` | 20th Feb | 2. | handled by external parties|Done| | `withdrawal-msf-2` | 20th Feb | 3. | handled by Pari|Done| | `withdrawal-devnet-8` | 20th Feb | 4.,5. | handled by ? | inadvertently tested on Goerli| ## Results ### withdrawal-devnet-7 withdrawal-devnet-7 highlighted some potential issues with some clients that we might experience on the mainnet too * very high CPU and RAM usage on fork transition. As a bunch of BLS messages are gossiped around the network, some nodes just simply crash due to insufficient RAM. Currently we were testing worst case scenario with 120 validating nodes, each running 5k validators. They were required to process a huge influx of BLS change messages (safe to assume quite some of them are lost). Will see it in a few days. Submitted 360k BLS change messages, causing a overflow issues for teku and lodestar. * The overall stress of hundreds of thousands of BLS changes on the mainnet should have a lot smaller overall effect than on devnet 7. The load will spread among more nodes. * We have recovered within 3 epochs to previous attestation rates. Epoch 450 had an attestation rate of 62% and wasn't finalizing for 2 epochs thereafter. * besu-prysm pairs discovered a deposit issue. Due to the stress test, continuous deposits and exits are processed side by side of bls changes. Even before shapella was triggered, prysm/besu pairs have had issue proposing blocks. To fix the issue the --rpc-http-max-batch-size=1000 flag has to be set on besu. Besu devs are looking into how to fix this for mainnet. We haven't caught this issue before, as we haven't been doing devnets with deposits, and zhejiang testnet is too small to notice these issues. Trigger scenario is when the deposit pool is full. * Besu has since released a fix, that has been deployed on devnet 7 and zhejiang testnet. ### withdrawal-mainnet-shadowfork-2 #### Experiment 1: We had to update the capella fork epoch on all nodes, this was the perfect oportunity to check the circuit breaker relating to missed slots being activated. So we cycled through all the client pairs and took them down for the upgrade - causing a consistent stream of missed slots as the nodes restarted. Test: If there are missed slots and the network is unhealthy, the circuit breaker will be triggered Observation: - Before the upgrade, we noticed healthy use of the relay: Teku: ``` 2023-03-03 11:27:48.483 INFO - Received Builder Bid (Block Number = 16746226, Block Hash = 0xb3610f9322096b74ce4debec407c859639e5172e4e75cc5eed6b8d79f72d2dcf, MEV Reward (wei) = 26161838280626996, Gas Limit = 30000000, Gas Used = 3160291) 2023-03-03 11:27:48.752 INFO - Received execution payload from Builder (Block Number 16746226, Block Hash = 0xb3610f9322096b74ce4debec407c859639e5172e4e75cc5eed6b8d79f72d2dcf) 2023-03-03 11:27:48.822 INFO - Validator *** Published block Count: 1, Slot: 70634, Root: 7be85d0096343694ae738a098215f42fe7eee228281df346eaab443528c5c584, 3160291 (10%) gas, EL block: b3610f9322096b74ce4debec407c859639e5172e4e75cc5eed6b8d79f72d2dcf (16746226) ``` Lighthouse: ``` Mar 03 11:20:00.547 INFO Received local and builder payloads parent_hash: 0xa53bd0ec743a60fd9ee80e7769f72e79dd2783278956de07ac00aed25d3920e6, local_block_hash: 0x9bfdfe52958342f9743280fd426dad0d2cf96e0b8b691ef4c7b2633956671c56, relay_block_hash: 0x13799b85d7e3a068a7080f0cfb2ed2467545aed8af357e65da0765880f6062ff, service: exec Mar 03 11:20:00.560 DEBG Sending block to builder root: 0xf4358d4d08ab1f9118c907b64856715db722f287c9e9bbf3890e153d179da15a, service: exec Mar 03 11:20:00.848 INFO Successfully published a block to the builder network, block_hash: 0x13799b85d7e3a068a7080f0cfb2ed2467545aed8af357e65da0765880f6062ff ``` - During the upgrade we saw missed slots and our teku & lighthouse node reported this in the logs: Teku: ``` 2023-03-03 11:30:00.018 INFO - Falling back to locally produced execution payload (Block Number 16746234, Block Hash = 0x637d8b24daed3c7812b9876ebc19d15288327cced3be7697327e1e7132fb40e0, Fallback Reason = circuit_breaker_engaged) 2023-03-03 11:30:00.086 INFO - Validator *** Published block Count: 1, Slot: 70645, Root: 285d5bdba1f8984f37c21ab44544074b78663f3fe86964328 ``` Lighthouse: ``` Mar 03 11:25:48.009 INFO Forwarding register validator request to connected builder, count: 100 Mar 03 11:30:48.009 INFO Chain is unhealthy, using local payload, failed_condition: SkipsPerEpoch, info: this helps protect the network. the --builder-fallback flags can adjust the expected health conditions., service: exec Mar 03 11:32:12.010 INFO Forwarding register validator request to connected builder, count: 100 ``` Prysm: ``` time="2023-03-03 11:58:24" level=warning msg="Circuit breaker activated due to missing consecutive slot. Ignore if mev-boost is not used" currentSlot=70787 highestReceivedSlot=70783 maxConsecutiveSkipSlotsAllowed=3 prefix="rpc/validator" ``` - Once all the nodes were done upgrading and there were fewer missed slots, we saw nodes rely on the builder again: Teku: ``` 2023-03-03 11:34:48.475 INFO - Received Builder Bid (Block Number = 16746256, Block Hash = 0x2603f51bb0d5ad28a5a3a9faec65dc6dd3590b3799c1b29ebfdc46df7ce30da1, MEV Reward (wei) = 26718394013555870, Gas Limit = 30000000, Gas Used = 3752109) 2023-03-03 11:34:48.659 INFO - Received execution payload from Builder (Block Number 16746256, Block Hash = 0x2603f51bb0d5ad28a5a3a9faec65dc6dd3590b3799c1b29ebfdc46df7ce30da1) 2023-03-03 11:34:48.727 INFO - Validator *** Published block Count: 1, Slot: 70669, Root: fc4f8231bc3b4db69256288f402d24d42c9baaf719b29143464bd15c6c3fb5f9, 3752109 (12%) gas, EL block: 2603f51bb0d5ad28a5a3a9faec65dc6dd3590b3799c1b29ebfdc46df7ce30da1 (16746256) ``` Lighthouse: ``` Mar 03 11:42:12.637 DEBG Sending block to builder root: 0x2ebc0bde019f940d4af43deaf89bb08a3b61f0f813fcf424eb03f0d51140fc97, service: exec Mar 03 11:42:12.848 INFO Successfully published a block to the builder network, block_hash: 0x9a577780b88f5e01a8182b3d1b7a7ed1f204d3ce9361d24dbdd53ac8b87a55de Mar 03 11:45:00.008 INFO Forwarding register validator request to connected builder, count: 100 ``` ##### Experiment 2: Test: If the relay is offline, the validators will failover to local production Observation: - Before the upgrade, we noticed healthy use of the relay: ``` 2023-03-03 11:15:12.550 INFO - Received Builder Bid (Block Number = 16746178, Block Hash = 0xeb562982a1d8726a8434d041b0738a25f9d5013a285b6f7702af13c14ceb8a57, MEV Reward (wei) = 12020124267539638, Gas Limit = 30000000, Gas Used = 1898132) 2023-03-03 11:15:12.839 INFO - Received execution payload from Builder (Block Number 16746178, Block Hash = 0xeb562982a1d8726a8434d041b0738a25f9d5013a285b6f7702af13c14ceb8a57) 2023-03-03 11:15:12.901 INFO - Validator *** Published block Count: 1, Slot: 70571, Root: 13ce98f12b2e64478e394438c6c0f0b8702525dd095a4849c0a196b4fdfe53e0, 1898132 (6%) gas, EL block: eb562982a1d8726a8434d041b0738a25f9d5013a285b6f7702af13c14ceb8a57 (16746178) ``` - The relay was then turned off, leading to the below log indicating that local block production is working: Teku: ``` 2023-03-03 13:45:36.023 INFO - Falling back to locally produced execution payload (Block Number 16746809, Block Hash = 0x96c8f25b7440836c336254dc7debeeb0993a506313aa226f493e211d6544475a, Fallback Reason = builder_not_available) 2023-03-03 13:45:36.066 INFO - Validator *** Published block Count: 1, Slot: 71323, Root: 1a48d534d19893fcd566d3ff88ac844af7be7645b596a235e8b62fafd994c61a, 5552215 (18%) gas, EL block: 96c8f25b7440836c336254dc7debeeb0993a506313aa226f493e211d6544475a (16746809) 2023-03-03 13:45:36.086 WARN - The builder is not available: java.net.ConnectException: Failed to connect to /159.223.215.206:18550. Block production will fallback to the execution engine. ``` Prysm: ``` time="2023-03-03 13:48:15" level=error msg="Failed to call relayer status endpoint, perhaps mev-boost or relayers are down" error="Get "http://159.223.215.206:18550/eth/v1/builder/status": dial tcp 159.223.215.206:18550: connect: connection refused" ``` - Due to local building working as expected, there are still a healthy number of blocks being produced on the network ##### Experiment 3: Test: If the relay is delivering invalid information, the validators will failover to local production Observation: - Before the upgrade, we noticed healthy use of the relay - The mock relay was configured to deliver invalid `parent_hash`s. This led to the Teku CL verifying this information and building locally immediately: ``` Caused by: tech.pegasys.teku.spec.logic.common.statetransition.exceptions.BlockProcessingException: Execution payload parent hash does not match previous execution payload header 2023-03-03 14:19:36.518 INFO - Falling back to locally produced execution payload (Block Number 16746963, Block Hash = 0x2cfaa6a06a4fde8b9cc4b3f1733048a13ee467c8b32d2e2c8718e7e29e06d4cd, Fallback Reason = builder_error) ``` - This validation however isn't done for every field, so we attempted the invalid `base_fee` instead ``` 2023-03-03 14:38:24.748 WARN - Payload for child of block root 0x13678231e8b70ef55818bcaaf6354ecd778143837dc8bafb10dc24306748f151 marked as invalid by Execution Client ``` - This led to a lot of missed slots, which then triggered the `circuit_breaker` conditions: ``` 2023-03-03 14:57:00.034 INFO - Falling back to locally produced execution payload (Block Number 16747101, Block Hash = 0x599f18e34713a7de1ac3d6c85c0250fca2778ad7879a87ad12835bdebddedecf, Fallback Reason = circuit_breaker_engaged) 2023-03-03 14:57:00.103 INFO - Validator *** Published block Count: 1, Slot: 71680, Root: 885179340656d376c9f209b297c672f19390abe423679d8961a6c6afd51e35c3, 2071465 (6%) gas, EL block: 599f18e34713a7de1ac3d6c85c0250fca2778ad7879a87ad12835bdebddedecf (16747101) ``` - Allowing for local block building and letting the network stay healthy