# Caplin experiment ## Intro The goal of this experiment was to run Caplin as a beacon node software. Additionally we wanted to run some Holesky validators using Caplin. Caplin itself doesn't have a validator client, so we would need to use some other client. Erigon reported back that they have experimented using the Lodestar and the Lighthouse validator clients. Initially we've tried to run Caplin standalone but it seems that there currently is no way to do that. Caplin is built into Erigon 3, which currently is still in Alpha. So, for testing Caplin, we also had to test Erigon 3. ### New syncing mechanism Erigon 3 comes with OtterSync, which is their solution to quickly sync the node using torrent/webseeds. The codebase for it can be seen on [erigontech/erigon-snapshot](https://github.com/erigontech/erigon-snapshot/tree/main). - Mainnet R2 bucket list: https://github.com/erigontech/erigon-snapshot/blob/main/webseed/mainnet.toml - Mainnet files list: https://github.com/erigontech/erigon-snapshot/blob/main/mainnet.toml - Example mainnet file: https://erigon3-v3-snapshots-mainnet.erigon.network/v2/domain/v1-accounts.0-1024.kv Example of OtterSync downloading Mainnet data from their webseed: ```sh [INFO] [08-12|14:29:14.461] [1/6 OtterSync] download progress="99.90% 889.6GB/889.6GB" time-left=999hrs:99m total-time=2h4m40s download=128.5MB/s flush=185.5MB/s hash=151.5MB/s complete=0B/s upload=0B/s peers=0 files=317 metadata=317/317 connections=0 alloc=4.4GB sys=17.2GB [INFO] [08-12|14:29:34.460] [1/6 OtterSync] download finished time=2h5m0.000125673s ``` ![](https://storage.googleapis.com/ethereum-hackmd/upload_840885ea338f10240721d0f21e5797a0.png) ## Configuration The following config was succesfully used to sync a Mainnet and Holesky node. There were some bugs where we had to wipe the datadir or restart the node from time-to-time so that it could keep up with syncing. Most of it seemed to be related to some race conditions in the erigon/caplin code. **Version:** [`v3.0.0-alpha2`](https://github.com/erigontech/erigon/releases/tag/v3.0.0-alpha2) (Mainnet) + [main-a264b9f](https://github.com/erigontech/erigon/tree/a264b9f86752695fa5e0a0f89e53f03ceeab6185) (Holesky). **Servers:** - 1x Holesky: DigitalOcean Memory Optimized 32GB - 4 CPU cores - 1TB Network attached volume block storage - 1x Mainnet: DigitalOcean Memory ~~Optimized 32GB - 4 cores 64GB~~ - 8 CPU cores - 5TB Network attached volume block storage Note: We had to bump the size for Mainnet, due to OOM issues. **CLI Flags:** The following example is for `mainnet`. To target other public testnets, we just need to change the `--chain` flag. e.g. `--chain=holesky`. ```shell=/bin/bash erigon --datadir='/data' --nat="extip:$IP_ADDR_PUBLIC" --port=30303 --http --http.addr='0.0.0.0' --http.port=8545 --authrpc.jwtsecret='/execution-auth.jwt' --authrpc.addr='0.0.0.0' --authrpc.port=8551 --authrpc.vhosts='*' --metrics --metrics.addr='0.0.0.0' --metrics.port=6060 --http.api='eth,erigon,engine,web3,net,debug,trace,txpool,admin' --http.vhosts='*' --chain='mainnet' # Full node, instead of archive --prune.mode='full' ## Caplin / Beacon specific flags --caplin.discovery.addr='0.0.0.0' --caplin.discovery.port=9000 --caplin.discovery.tcpport=9000 --caplin.backfilling --beacon.api='beacon,config,debug,events,node,validator,lighthouse' --beacon.api.addr='0.0.0.0' --beacon.api.port=5052 --beacon.api.cors.allow-origins='*' ``` ### Sync test runs There were many test runs with different configurations/versions trying to get the node to work. Here we're just listing the most recent succesfull runs: - **Mainnet**: - Node start: 12.08.2024 12:15 UTC - Node synced: 13.08.2024 11:30 UTC - Total sync time: 23h 15min - Disk used: 1002 GB - Graphs/Metrics on [Grafana](https://grafana.observability.ethpandaops.io/d/cas2BrpaEr7k/ethereum-metrics-exporter-overview?orgId=1&var-filter=ingress_user%7C%3D%7Cmainnet&var-filter=consensus_client%7C%3D%7Ccaplin&var-filter=execution_client%7C%3D%7Cerigon&from=1723465800000&to=1723555800000) - **Holesky**: - Node start: 14.08.2024 18:48 UTC - Node synced: 15.08.2024 11:23 UTC - Total sync time: 16h 35min - Disk used: 84GB - Graphs/Metrics on [Grafana](https://grafana.observability.ethpandaops.io/d/cas2BrpaEr7k/ethereum-metrics-exporter-overview?orgId=1&var-filter=ingress_user%7C%3D%7Cholesky&var-filter=consensus_client%7C%3D%7Ccaplin&var-filter=execution_client%7C%3D%7Cerigon&from=1723718628455&to=1723721590000) ### Validator test run on Holesky - **16.08.2024 at 13:20 UTC** - After observing that Caplin stayed in sync, we've _attached 10.000 validators to the Caplin beacon node by using the Lighthouse validator client. Initial observations showed the validator sucessfully publishing attestations and sync committee messages. - **16.08.2024 at 13:29 UTC** - The beacon node got killed due to OOM. Machine had 32GB RAM. - **16.08.2024 at 13:45 UTC** - Erigon/Caplin started again, now with 64GB of RAM. - **16.08.2024 at 13:59 UTC** - Node is synced again to the tip of the chain and is execution its validator duties. - **16.08.2024 at 14:21 UTC** - Erigon/Caplin dies again due to OOM. This was reported back to Erigon. We gave them access to the server so that they could debug the problem further. ![](https://storage.googleapis.com/ethereum-hackmd/upload_cd089f7c13c94f7d796d26c74ef112c5.png) Fig 1. Memory spike happening after attesting for around ~5min ![](https://storage.googleapis.com/ethereum-hackmd/upload_f5ae247e72f327f6cee690a910286de7.png) Fig 2. Output for `go tool pprof -inuse_space -png http://127.0.0.1:6060/debug/pprof/heap > mem.png` - **19.08.2024 at 11:01 UTC** - Deployed a new version `bugfix/massive_vc_memory_exhausted` - **19.08.2024 at 11:22 UTC** - Validator publishing attestations again. - **19.08.2024 at 11:31 UTC** - First slot published with Caplin https://holesky.beaconcha.in/slot/2347059#overview . Note that sync aggregate is 0% which is weird. Also, the next slot was missed https://holesky.beaconcha.in/slot/2347060. - **19.08.2024 at 11:54 UTC** - Failed to produce slot `2347170` ## Reported problems: - No support for custom testnets. Reported and fixed. --- - Caplin can't run standalone. The problem was acknowledged and it was said that it's mainly due to maintenance burden with other clients. So initially they want to focus on making it work well with Erigon only. They don't want to have the extra burden to guarrantee that it will work across all other EL clients. --- - When running without `--caplin.backfilling` it didn't start OtterSync. Making the sync progress last a long time. Problem was acknowledged and is fixed. --- - Memory requirements seem a bit high for Mainnet. We couldn't sync Erigon/Caplin on a node with 32GB RAM due to the OS killing the process. The problem was acknowledged and there is already some work going on to improve this. ``` [Wed Jul 31 23:10:39 2024] Out of memory: Killed process 66075 (erigon) total-vm:19960624940kB, anon-rss:28957016kB, file-rss:0kB, shmem-rss:0kB, UID:1002 pgtables:2311640kB oom_score_adj:0 [Fri Aug 2 20:07:55 2024] Out of memory: Killed process 78841 (erigon) total-vm:19967236448kB, anon-rss:30693004kB, file-rss:0kB, shmem-rss:0kB, UID:1002 pgtables:597448kB oom_score_adj:0 [Sun Aug 4 03:00:27 2024] Out of memory: Killed process 93140 (erigon) total-vm:19971976628kB, anon-rss:30593408kB, file-rss:0kB, shmem-rss:0kB, UID:1002 pgtables:638740kB oom_score_adj:0 [Mon Aug 5 04:54:02 2024] Out of memory: Killed process 104207 (erigon) total-vm:19990912580kB, anon-rss:30608156kB, file-rss:0kB, shmem-rss:0kB, UID:1002 pgtables:621160kB oom_score_adj:0 [Tue Aug 6 06:27:42 2024] Out of memory: Killed process 111174 (erigon) total-vm:19983242864kB, anon-rss:30664348kB, file-rss:0kB, shmem-rss:0kB, UID:1002 pgtables:588652kB oom_score_adj:0 [Thu Aug 8 11:50:20 2024] Out of memory: Killed process 123106 (erigon) total-vm:20001339904kB, anon-rss:30559600kB, file-rss:0kB, shmem-rss:0kB, UID:1002 pgtables:685308kB oom_score_adj:0 [Sat Aug 10 12:46:49 2024] Out of memory: Killed process 141733 (erigon) total-vm:19997411952kB, anon-rss:30621464kB, file-rss:0kB, shmem-rss:0kB, UID:1002 pgtables:616048kB oom_score_adj:0 [Sat Aug 10 22:05:05 2024] Out of memory: Killed process 163328 (erigon) total-vm:20009499480kB, anon-rss:30776268kB, file-rss:0kB, shmem-rss:0kB, UID:1002 pgtables:639132kB oom_score_adj:0 [Sun Aug 11 02:04:18 2024] Out of memory: Killed process 167606 (erigon) total-vm:20000623540kB, anon-rss:30977540kB, file-rss:0kB, shmem-rss:0kB, UID:1002 pgtables:546936kB oom_score_adj:0 [Sun Aug 11 06:18:49 2024] Out of memory: Killed process 184159 (erigon) total-vm:20003823956kB, anon-rss:30945036kB, file-rss:0kB, shmem-rss:0kB, UID:1002 pgtables:562864kB oom_score_adj:0 [Sun Aug 11 18:33:36 2024] Out of memory: Killed process 197465 (erigon) total-vm:20002844484kB, anon-rss:30708068kB, file-rss:0kB, shmem-rss:0kB, UID:1002 pgtables:599596kB oom_score_adj:0 [Mon Aug 12 00:51:19 2024] Out of memory: Killed process 203400 (erigon) total-vm:20012108372kB, anon-rss:30870032kB, file-rss:0kB, shmem-rss:0kB, UID:1002 pgtables:593056kB oom_score_adj:0 [Mon Aug 12 03:07:43 2024] Out of memory: Killed process 205960 (erigon) total-vm:20005441256kB, anon-rss:31505872kB, file-rss:0kB, shmem-rss:0kB, UID:1002 pgtables:258048kB oom_score_adj:0 ``` --- - Beacon API endpoint not responding correctly. Causing the ethereum-metrics-exporter to not get the data it wants. The problem was acknowledged. ```sh curl 172.18.0.3:5052/eth/v2/beacon/blocks/finalized # Responds with: {"code":500,"message":"transactions not found for block 20511822"} ``` --- - Crash when using OtterSync on Holesky: ``` [INFO] [08-12|15:40:21.971] [1/6 OtterSync] Indexing progress="v1-001000-001100-transactions.seg=19%, v1-001100-001200-transactions.seg=45%, v1-001200-001300-transactions.seg=18%" total-indexing-time=8m20s alloc=4.9GB sys=17.0GB [INFO] [08-12|15:40:41.960] [1/6 OtterSync] Indexing progress="v1-001000-001100-transactions.seg=27%, v1-001100-001200-transactions.seg=50%, v1-001200-001300-transactions.seg=26%" total-indexing-time=8m40s alloc=6.0GB sys=17.0GB [INFO] [08-12|15:40:49.830] [Antiquary] Processed snapshots progress=85667 target=2099999 [INFO] [08-12|15:41:01.959] [1/6 OtterSync] Indexing progress="v1-001000-001100-transactions.seg=33%, v1-001100-001200-transactions.seg=50%, v1-001200-001300-transactions.seg=37%" total-indexing-time=9m0s alloc=6.8GB sys=17.0GB [INFO] [08-12|15:41:19.835] [Antiquary] Processed snapshots progress=99566 target=2099999 panic: runtime error: index out of range [3532] with length 3532 [recovered] panic: ReadHeader(99999), runtime error: index out of range [3532] with length 3532, [caplin_snapshots.go:598 panic.go:770 panic.go:120 elias_fano.go:176 elias_fano.go:187 index.go:357 caplin_snapshots.go:617 antiquary.go:165 run.go:283 asm_amd64.s:1695] goroutine 12262 [running]: github.com/erigontech/erigon/turbo/snapshotsync/freezeblocks.(*CaplinSnapshots).ReadHeader.func1() github.com/erigontech/erigon/turbo/snapshotsync/freezeblocks/caplin_snapshots.go:598 +0xe8 panic({0x2825ea0?, 0xc17e548060?}) runtime/panic.go:770 +0x132 github.com/erigontech/erigon-lib/recsplit/eliasfano32.(*EliasFano).get(0xc072246280, 0x1008aae52db?) github.com/erigontech/[email protected]/recsplit/eliasfano32/elias_fano.go:176 +0x36d github.com/erigontech/erigon-lib/recsplit/eliasfano32.(*EliasFano).Get(...) github.com/erigontech/[email protected]/recsplit/eliasfano32/elias_fano.go:187 github.com/erigontech/erigon-lib/recsplit.(*Index).OrdinalLookup(...) github.com/erigontech/[email protected]/recsplit/index.go:357 github.com/erigontech/erigon/turbo/snapshotsync/freezeblocks.(*CaplinSnapshots).ReadHeader(0xc001236240, 0x1869f) github.com/erigontech/erigon/turbo/snapshotsync/freezeblocks/caplin_snapshots.go:617 +0x217 github.com/erigontech/erigon/cl/antiquary.(*Antiquary).Loop(0xc01624ec40) github.com/erigontech/erigon/cl/antiquary/antiquary.go:165 +0x86a github.com/erigontech/erigon/cmd/caplin/caplin1.RunCaplinPhase1.func4() github.com/erigontech/erigon/cmd/caplin/caplin1/run.go:283 +0x2d created by github.com/erigontech/erigon/cmd/caplin/caplin1.RunCaplinPhase1 in goroutine 182 github.com/erigontech/erigon/cmd/caplin/caplin1/run.go:282 +0x2009 ``` Followed with the following error and a crash loop: ``` [INFO] [08-13|09:11:23.152] Beacon API started addr=0.0.0.0:5052 [INFO] [08-13|09:11:23.152] [Caplin] starting clstages loop app=caplin [INFO] [08-13|09:11:23.153] Starting downloading History app=caplin stage=DownloadHistoricalBlocks from=2303072 [INFO] [08-13|09:11:23.420] [Antiquary] Stopping Caplin to process historical indicies from=0 to=2099999 panic: runtime error: index out of range [3532] with length 3532 [recovered] panic: ReadHeader(99999), runtime error: index out of range [3532] with length 3532, [caplin_snapshots.go:598 panic.go:770 panic.go:120 elias_fano.go:176 elias_fano.go:187 index.go:357 caplin_snapshots.go:617 antiquary.go:165 run.go:283 asm_amd64.s:1695] goroutine 16311 [running]: github.com/erigontech/erigon/turbo/snapshotsync/freezeblocks.(*CaplinSnapshots).ReadHeader.func1() github.com/erigontech/erigon/turbo/snapshotsync/freezeblocks/caplin_snapshots.go:598 +0xe8 panic({0x2825ea0?, 0xc1846d7950?}) runtime/panic.go:770 +0x132 github.com/erigontech/erigon-lib/recsplit/eliasfano32.(*EliasFano).get(0xc0580f45a0, 0x1008aae52db?) github.com/erigontech/[email protected]/recsplit/eliasfano32/elias_fano.go:176 +0x36d github.com/erigontech/erigon-lib/recsplit/eliasfano32.(*EliasFano).Get(...) github.com/erigontech/[email protected]/recsplit/eliasfano32/elias_fano.go:187 github.com/erigontech/erigon-lib/recsplit.(*Index).OrdinalLookup(...) github.com/erigontech/[email protected]/recsplit/index.go:357 github.com/erigontech/erigon/turbo/snapshotsync/freezeblocks.(*CaplinSnapshots).ReadHeader(0xc00645e1b0, 0x1869f) github.com/erigontech/erigon/turbo/snapshotsync/freezeblocks/caplin_snapshots.go:617 +0x217 github.com/erigontech/erigon/cl/antiquary.(*Antiquary).Loop(0xc004cdba40) github.com/erigontech/erigon/cl/antiquary/antiquary.go:165 +0x86a github.com/erigontech/erigon/cmd/caplin/caplin1.RunCaplinPhase1.func4() github.com/erigontech/erigon/cmd/caplin/caplin1/run.go:283 +0x2d created by github.com/erigontech/erigon/cmd/caplin/caplin1.RunCaplinPhase1 in goroutine 3842 github.com/erigontech/erigon/cmd/caplin/caplin1/run.go:282 +0x2009 ``` The problem was acknowledged. As a workaround we had to disable the `--caplin.backfilling=true` flag. After disabling that flag, the node stopped crashing and started syncing again. The backfill flag could then be enabled again once the node was synced. --- - Caplin seemed to do "ForwardSync" a bit too often. Indicating that its' falling behind. Problem reported. It was said that it could be a race condition with erigon also syncing. ``` [INFO] [08-13|08:51:47.029] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726089 distance-from-chain-tip=8m36s estimated-time-remaining=1m47s [INFO] [08-13|08:52:02.351] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726102 distance-from-chain-tip=6m0s estimated-time-remaining=1m9s [INFO] [08-13|08:52:22.527] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726114 distance-from-chain-tip=3m36s estimated-time-remaining=45s [INFO] [08-13|08:53:07.046] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726139 distance-from-chain-tip=-1m24s estimated-time-remaining=0s [INFO] [08-13|08:53:07.047] [Caplin] Forward Sync app=caplin stage=ForwardSync from=9726048 to=9726263 [INFO] [08-13|08:54:51.851] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726060 distance-from-chain-tip=40m36s estimated-time-remaining=8m27s [INFO] [08-13|08:55:09.006] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726073 distance-from-chain-tip=38m0s estimated-time-remaining=7m18s [INFO] [08-13|08:56:06.829] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726089 distance-from-chain-tip=34m48s estimated-time-remaining=5m26s [INFO] [08-13|08:56:55.475] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726102 distance-from-chain-tip=32m12s estimated-time-remaining=6m11s [INFO] [08-13|08:57:41.137] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726114 distance-from-chain-tip=29m48s estimated-time-remaining=6m12s [INFO] [08-13|08:59:06.599] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726126 distance-from-chain-tip=27m24s estimated-time-remaining=5m42s [INFO] [08-13|09:00:33.504] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726139 distance-from-chain-tip=24m48s estimated-time-remaining=4m46s [INFO] [08-13|09:01:02.312] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726151 distance-from-chain-tip=22m24s estimated-time-remaining=4m40s [INFO] [08-13|09:01:17.405] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726164 distance-from-chain-tip=19m48s estimated-time-remaining=3m48s [INFO] [08-13|09:01:47.216] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726190 distance-from-chain-tip=14m36s estimated-time-remaining=1m24s [INFO] [08-13|09:02:32.828] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726215 distance-from-chain-tip=9m36s estimated-time-remaining=57s [INFO] [08-13|09:02:49.363] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726228 distance-from-chain-tip=7m0s estimated-time-remaining=1m20s [INFO] [08-13|09:03:18.475] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726252 distance-from-chain-tip=2m12s estimated-time-remaining=13s [INFO] [08-13|09:03:56.465] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726265 distance-from-chain-tip=-24s estimated-time-remaining=0s [INFO] [08-13|09:03:56.465] [Caplin] Forward Sync app=caplin stage=ForwardSync from=9726176 to=9726317 [INFO] [08-13|09:05:03.104] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726189 distance-from-chain-tip=25m36s estimated-time-remaining=4m55s [INFO] [08-13|09:06:23.046] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726202 distance-from-chain-tip=23m0s estimated-time-remaining=4m25s [INFO] [08-13|09:07:25.945] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726215 distance-from-chain-tip=20m24s estimated-time-remaining=3m55s [INFO] [08-13|09:08:05.908] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726228 distance-from-chain-tip=17m48s estimated-time-remaining=3m25s [INFO] [08-13|09:08:51.265] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726240 distance-from-chain-tip=15m24s estimated-time-remaining=3m12s [INFO] [08-13|09:11:21.936] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726252 distance-from-chain-tip=13m0s estimated-time-remaining=2m42s [INFO] [08-13|09:14:13.089] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726265 distance-from-chain-tip=10m24s estimated-time-remaining=2m0s [INFO] [08-13|09:14:46.499] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726278 distance-from-chain-tip=7m48s estimated-time-remaining=1m30s [INFO] [08-13|09:15:01.152] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726291 distance-from-chain-tip=5m12s estimated-time-remaining=1m0s [INFO] [08-13|09:15:35.830] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726304 distance-from-chain-tip=2m36s estimated-time-remaining=30s [INFO] [08-13|09:16:38.917] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726313 distance-from-chain-tip=48s estimated-time-remaining=13s [INFO] [08-13|09:17:08.022] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726323 distance-from-chain-tip=-1m12s estimated-time-remaining=0s [INFO] [08-13|09:17:08.023] [Caplin] Forward Sync app=caplin stage=ForwardSync from=9726240 to=9726383 [INFO] [08-13|09:19:08.850] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726252 distance-from-chain-tip=26m12s estimated-time-remaining=5m27s [INFO] [08-13|09:20:51.528] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726265 distance-from-chain-tip=23m36s estimated-time-remaining=4m32s [INFO] [08-13|09:21:22.492] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726278 distance-from-chain-tip=21m0s estimated-time-remaining=4m2s [INFO] [08-13|09:22:12.446] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726291 distance-from-chain-tip=18m24s estimated-time-remaining=3m32s [INFO] [08-13|09:22:50.968] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726304 distance-from-chain-tip=15m48s estimated-time-remaining=3m2s [INFO] [08-13|09:23:58.688] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726313 distance-from-chain-tip=14m0s estimated-time-remaining=3m53s [INFO] [08-13|09:25:14.347] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726323 distance-from-chain-tip=12m0s estimated-time-remaining=3m0s [INFO] [08-13|09:25:45.552] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726336 distance-from-chain-tip=9m24s estimated-time-remaining=1m48s [INFO] [08-13|09:26:09.302] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726348 distance-from-chain-tip=7m0s estimated-time-remaining=1m27s [INFO] [08-13|09:26:41.715] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726374 distance-from-chain-tip=1m48s estimated-time-remaining=10s [INFO] [08-13|09:27:01.087] [Caplin] Forward Sync app=caplin stage=ForwardSync from=9726304 to=9726433 [INFO] [08-13|09:28:15.596] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726313 distance-from-chain-tip=24m0s estimated-time-remaining=6m40s [INFO] [08-13|09:29:14.088] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726323 distance-from-chain-tip=22m0s estimated-time-remaining=5m30s [INFO] [08-13|09:30:25.807] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726336 distance-from-chain-tip=19m24s estimated-time-remaining=3m43s [INFO] [08-13|09:31:12.138] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726348 distance-from-chain-tip=17m0s estimated-time-remaining=3m32s [INFO] [08-13|09:31:55.882] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726361 distance-from-chain-tip=14m24s estimated-time-remaining=2m46s [INFO] [08-13|09:33:01.036] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726374 distance-from-chain-tip=11m48s estimated-time-remaining=2m16s [INFO] [08-13|09:34:00.715] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726386 distance-from-chain-tip=9m24s estimated-time-remaining=1m57s [INFO] [08-13|09:34:22.007] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726398 distance-from-chain-tip=7m0s estimated-time-remaining=1m27s [INFO] [08-13|09:34:47.819] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726411 distance-from-chain-tip=4m24s estimated-time-remaining=50s [INFO] [08-13|09:35:13.806] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726437 distance-from-chain-tip=-48s estimated-time-remaining=0s [INFO] [08-13|10:50:33.363] [Caplin] Forward Sync app=caplin stage=ForwardSync from=9726368 to=9726850 [INFO] [08-13|10:51:51.972] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726381 distance-from-chain-tip=1h33m48s estimated-time-remaining=18m2s [INFO] [08-13|10:53:11.172] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726394 distance-from-chain-tip=1h31m12s estimated-time-remaining=17m32s [INFO] [08-13|10:54:30.937] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9726407 distance-from-chain-tip=1h28m36s estimated-time-remaining=17m2s ``` --- - Problem with erigon getting stuck at a specific block height. Probably related to the previous race condition. This problem solved itself after 1h. ![](https://storage.googleapis.com/ethereum-hackmd/upload_38ac7a6d671ac6c18855301470fcb759.png) --- - Holesky wasn't syncing. OtterSync seemed to be stuck somehow. Logs were showing: ``` [INFO] [08-13|15:06:57.478] [snapshots] no metadata yet files=18 list=v1-002017-002018-bodies.seg,v1-002015-002016-bodies.seg,v1-002000-002010-headers.seg,v1-002016-002017-bodies.seg,v1-002012-002013-bodies.seg,... [INFO] [08-13|15:07:07.851] [1/6 OtterSync] downloading header-chain progress="99.90% 1.2GB/1.2GB" time-left=999hrs:99m total-time=58m40s download=0B/s flush=0B/s hash=0B/s complete=0B/s upload=0B/s peers=0 files=58 metadata=40/58 connections=0 alloc=6.1GB sys=8.6GB [INFO] [08-13|15:07:17.479] [snapshots] no metadata yet files=18 list=v1-002017-002018-headers.seg,v1-002013-002014-bodies.seg,v1-002013-002014-headers.seg,v1-002010-002011-bodies.seg,v1-002014-002015-bodies.seg,... [INFO] [08-13|15:07:27.850] [1/6 OtterSync] downloading header-chain progress="99.90% 1.2GB/1.2GB" time-left=999hrs:99m total-time=59m0s download=0B/s flush=0B/s hash=0B/s complete=0B/s upload=0B/s peers=0 files=58 metadata=40/58 connections=0 alloc=5.0GB sys=8.6GB [INFO] [08-13|15:07:37.478] [snapshots] no metadata yet files=18 list=v1-002012-002013-bodies.seg,v1-002010-002011-headers.seg,v1-002015-002016-headers.seg,v1-002017-002018-headers.seg,v1-002013-002014-bodies.seg,... [INFO] [08-13|15:07:47.851] [1/6 OtterSync] downloading header-chain progress="99.90% 1.2GB/1.2GB" time-left=999hrs:99m total-time=59m20s download=0B/s flush=0B/s hash=0B/s complete=0B/s upload=0B/s peers=0 files=58 metadata=40/58 connections=0 alloc=3.8GB sys=8.6GB ``` Issue created https://github.com/erigontech/erigon/issues/11617 --- - Mainnet beacon node got stuck after some time. Problem was reported and full logs were sent. Some sample: ``` [WARN] [08-14|19:18:05.248] [Caplin] Failed to process block batch app=caplin stage=ForwardSync err="bad blocks segment received: replay block, status chain missing segment" [INFO] [08-14|19:18:05.249] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9731006 distance-from-chain-tip=13m24s estimated-time-remaining=999h0m0s [WARN] [08-14|19:18:08.636] [Caplin] Failed to process block batch app=caplin stage=ForwardSync err="bad blocks segment received: replay block, status chain missing segment" [INFO] [08-14|19:18:08.636] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9731006 distance-from-chain-tip=13m24s estimated-time-remaining=999h0m0s [WARN] [08-14|19:18:33.898] [Caplin] Failed to process block batch app=caplin stage=ForwardSync err="bad blocks segment received: replay block, status chain missing segment" [INFO] [08-14|19:18:35.034] Subscribed to event stream topics topics="Set{block, voluntary_exit, chain_reorg, finalized_checkpoint, head, contribution_and_proof, attestation}" [WARN] [08-14|19:18:39.409] [Caplin] Failed to process block batch app=caplin stage=ForwardSync err="bad blocks segment received: replay block, status chain missing segment" [INFO] [08-14|19:18:39.410] [Caplin] Forward Sync app=caplin stage=ForwardSync progress=9731006 distance-from-chain-tip=13m24s estimated-time-remaining=999h0m0s [WARN] [08-14|19:18:40.416] [Caplin] Failed to process block batch app=caplin stage=ForwardSync err="bad blocks segment received: replay block, status chain missing segment" [WARN] [08-14|19:18:48.182] [Caplin] Failed to process block batch app=caplin stage=ForwardSync err="bad blocks segment received: replay block, status chain missing segment" [WARN] [08-14|19:18:48.871] [Caplin] Failed to process block batch app=caplin stage=ForwardSync err="bad blocks segment received: replay block, status chain missing segment" [INFO] [08-14|19:18:53.736] P2P app=caplin peers=3 [WARN] [08-14|19:19:03.103] [Caplin] Failed to process block batch app=caplin stage=ForwardSync err="bad blocks segment received: replay block, status chain missing segment" ``` --- - Metric for block execution speed isn't being reported corretly: ``` curl -s 172.18.0.3:6060/debug/metrics/prometheus | grep chain_execution_seconds # HELP chain_execution_seconds # TYPE chain_execution_seconds summary chain_execution_seconds{quantile="0.5"} NaN chain_execution_seconds{quantile="0.9"} NaN chain_execution_seconds{quantile="0.97"} NaN chain_execution_seconds{quantile="0.99"} NaN chain_execution_seconds_sum 0 chain_execution_seconds_count 0 ``` --- - A memory issue when running many validators against the beacon node. We weren't able to run 10k validators on a 64GB machine. A spike in memory always resulted in erigon being killed. --- - While it was able to publish a slot https://holesky.beaconcha.in/slot/2347059#overview, thesync aggregate is 0%. --- - It failed to produce slot `2347170`: ``` [INFO] [08-19|11:54:11.738] [ForkChoiceUpdated] BlockBuilder added payload=3 [INFO] [08-19|11:54:11.738] Building block... [WARN] [08-19|11:54:11.757] Failed to build a block err="[1/4 MiningCreateBlock] wrong head block: 253e8d7e554da36c0c30d382a71c4ba94e8af65fdf4f3b1d6f149c554e5e9e8e (current) vs d12f9ec06d039abb73e2603534901f05c2cb7533fa853441a24b6d3211e9b00b (requested)" [EROR] [08-19|11:54:11.757] Failed to build PoS block err="[1/4 MiningCreateBlock] wrong head block: 253e8d7e554da36c0c30d382a71c4ba94e8af65fdf4f3b1d6f149c554e5e9e8e (current) vs d12f9ec06d039abb73e2603534901f05c2cb7533fa853441a24b6d3211e9b00b (requested)" [EROR] [08-19|11:54:11.757] BlockProduction: Failed to get payload err="[1/4 MiningCreateBlock] wrong head block: 253e8d7e554da36c0c30d382a71c4ba94e8af65fdf4f3b1d6f149c554e5e9e8e (current) vs d12f9ec06d039abb73e2603534901f05c2cb7533fa853441a24b6d3211e9b00b (requested)" [EROR] [08-19|11:54:13.738] Failed to produce beacon body err="failed to produce execution payload" slot=2347170 [WARN] [08-19|11:54:13.738] Failed to produce block err="failed to produce execution payload" slot=2347170 ``` - And also slot `2347243`: ``` [EROR] [08-19|12:08:47.546] Failed to build PoS block err="[1/4 MiningCreateBlock] wrong head block: dd2350fab3a3b84199f35c443064c67c0240015ef396465c0712d5fc1c1e6cf9 (current) vs bf21f745e764e86f5b05be358e4b5c5702ba9eb1da2747ce4edcce67ecb96dab (requested)" [EROR] [08-19|12:08:47.546] BlockProduction: Failed to get payload err="[1/4 MiningCreateBlock] wrong head block: dd2350fab3a3b84199f35c443064c67c0240015ef396465c0712d5fc1c1e6cf9 (current) vs bf21f745e764e86f5b05be358e4b5c5702ba9eb1da2747ce4edcce67ecb96dab (requested)" [EROR] [08-19|12:08:47.556] Failed to produce beacon body err="failed to produce execution payload" slot=2347243 [WARN] [08-19|12:08:47.556] Failed to produce block err="failed to produce execution payload" slot=2347243 ```