## Ethereum Prometheus alerts Nearly all of our Ethereum network alerts are based on a % of the nodes we run for a network hitting a condition. We generally don't alert on individual instances having issues through our Ethereum channels. Alerting on a % of nodes seeing a condition generally leads to a higher signal-to-noise ratio (i.e. when there is an actual issue), but some of these can still be pretty annoying so might need individual tweaking. ### Finalized Epoch Stalled on % of nodes we run on a network #### Description: Only alerts if more than 35% of our nodes on a network have had a stalled finalized epoch for more than 20mins. #### Alert config: ```yaml evaluate_every: 1m for: 2m expression: $A > 0.35 ``` #### Query: ```prometheus count by (network)( count by (instance, network)( changes(beacon_finalized_epoch{network!=""}[20m]) == 0 ) ) / count by (network)( count by (instance, network)( beacon_finalized_epoch{network!=""} ) ) ``` ### Head slot not advancing on % of nodes we run on a network #### Description Only alerts if more than 30% of our nodes on a network have had a stalled head slot for more than 20mins. Same as above, trying to reduce noise. #### Alert config: ```yaml evaluate_every: 1m for: 5m expression: $A > 0.35 ``` #### Query: ```prometheus count by (network) ( count by (network, instance)( changes( beacon_head_slot{network!=""}[15m] ) == 0 ) ) / count by (network) ( count by (network, instance)( beacon_head_slot{network!=""} ) ) ``` ### Justified -> Finalized distance greater than 1 on % of nodes we run on a network #### Description Same as above, but for finalized -> Justified distance. #### Alert config: ```yaml evaluate_every: 1m for: 5m expression: $A > 0.35 ``` #### Query: ```prometheus count by (network)( count by (network, instance)( ( beacon_current_justified_epoch{network!=""} - beacon_finalized_epoch{} ) > 1 ) ) / count by (network)( count by (instance, network)( beacon_current_justified_epoch{network!=""} ) ) ``` ### More than 2 reorgs in 5 mins on % of nodes we run on a network #### Description Same as above, needs 30% of our nodes on a network to have seen more than 2 reorgs in the last 5 minutes to fire. #### Alert config: ```yaml evaluate_every: 1m for: 0m expression: $A > 0.30 ``` #### Query: ```prometheus count by (network)( count by (instance, network)( increase( beacon_reorgs_total{network!=""} )[5m] > 2 ) ) / count by (network)( count by (instance, network)( beacon_head_slot{network!=""} # note that this query is actually just giving us a "total number of nodes that we are running on the network ) ) ``` ### Trailing distance greater than 50 slots for % of nodes we run on a network #### Description Same as above, needs 30% of our nodes on a network to be trailing behind the wall clock slot to fire. Note: Have barely seen this one fire, might need double checking. #### Alert config: ```yaml evaluate_every: 1m for: 5m expression: $A > 0.30 ``` #### Query: ```prometheus count by (network)( ( beacon_slot{ network!="", } - beacon_head_slot{} ) > 50 ) / count by (network)( ( beacon_slot{ network!="", } - beacon_head_slot{} ) ) ``` ### Less than 5 peers for 5 minutes for % of nodes we run on a network #### Description Same as above, needs 30% of our nodes on a network to have less than 5 peers for the last 5 minutes. #### Alert config: ```yaml evaluate_every: 1m for: 20m # large `for` value here to try and account for fresh networks. Should probably be much lower (1m?) for setups that aren't experiencing a network startup/genesis frequently expression: $A > 0.30 ``` #### Query: ```prometheus count by (network)( sum by (instance, network)( avg_over_time( libp2p_peers{network!=""} )[5m] ) < 5 ) / count by (network)( count by (instance, network)( libp2p_peers{network!=""} ) ) ```