## Ethereum Prometheus alerts
Nearly all of our Ethereum network alerts are based on a % of the nodes we run for a network hitting a condition. We generally don't alert on individual instances having issues through our Ethereum channels.
Alerting on a % of nodes seeing a condition generally leads to a higher signal-to-noise ratio (i.e. when there is an actual issue), but some of these can still be pretty annoying so might need individual tweaking.
### Finalized Epoch Stalled on % of nodes we run on a network
#### Description:
Only alerts if more than 35% of our nodes on a network have had a stalled finalized epoch for more than 20mins.
#### Alert config:
```yaml
evaluate_every: 1m
for: 2m
expression: $A > 0.35
```
#### Query:
```prometheus
count by (network)(
count by (instance, network)(
changes(beacon_finalized_epoch{network!=""}[20m]) == 0
)
)
/
count by (network)(
count by (instance, network)(
beacon_finalized_epoch{network!=""}
)
)
```
### Head slot not advancing on % of nodes we run on a network
#### Description
Only alerts if more than 30% of our nodes on a network have had a stalled head slot for more than 20mins. Same as above, trying to reduce noise.
#### Alert config:
```yaml
evaluate_every: 1m
for: 5m
expression: $A > 0.35
```
#### Query:
```prometheus
count by (network) (
count by (network, instance)(
changes(
beacon_head_slot{network!=""}[15m]
) == 0
)
)
/
count by (network) (
count by (network, instance)(
beacon_head_slot{network!=""}
)
)
```
### Justified -> Finalized distance greater than 1 on % of nodes we run on a network
#### Description
Same as above, but for finalized -> Justified distance.
#### Alert config:
```yaml
evaluate_every: 1m
for: 5m
expression: $A > 0.35
```
#### Query:
```prometheus
count by (network)(
count by (network, instance)(
(
beacon_current_justified_epoch{network!=""}
-
beacon_finalized_epoch{}
) > 1
)
)
/
count by (network)(
count by (instance, network)(
beacon_current_justified_epoch{network!=""}
)
)
```
### More than 2 reorgs in 5 mins on % of nodes we run on a network
#### Description
Same as above, needs 30% of our nodes on a network to have seen more than 2 reorgs in the last 5 minutes to fire.
#### Alert config:
```yaml
evaluate_every: 1m
for: 0m
expression: $A > 0.30
```
#### Query:
```prometheus
count by (network)(
count by (instance, network)(
increase(
beacon_reorgs_total{network!=""}
)[5m] > 2
)
)
/
count by (network)(
count by (instance, network)(
beacon_head_slot{network!=""} # note that this query is actually just giving us a "total number of nodes that we are running on the network
)
)
```
### Trailing distance greater than 50 slots for % of nodes we run on a network
#### Description
Same as above, needs 30% of our nodes on a network to be trailing behind the wall clock slot to fire.
Note: Have barely seen this one fire, might need double checking.
#### Alert config:
```yaml
evaluate_every: 1m
for: 5m
expression: $A > 0.30
```
#### Query:
```prometheus
count by (network)(
(
beacon_slot{
network!="",
}
-
beacon_head_slot{}
) > 50
)
/
count by (network)(
(
beacon_slot{
network!="",
}
-
beacon_head_slot{}
)
)
```
### Less than 5 peers for 5 minutes for % of nodes we run on a network
#### Description
Same as above, needs 30% of our nodes on a network to have less than 5 peers for the last 5 minutes.
#### Alert config:
```yaml
evaluate_every: 1m
for: 20m # large `for` value here to try and account for fresh networks. Should probably be much lower (1m?) for setups that aren't experiencing a network startup/genesis frequently
expression: $A > 0.30
```
#### Query:
```prometheus
count by (network)(
sum by (instance, network)(
avg_over_time(
libp2p_peers{network!=""}
)[5m]
) < 5
)
/
count by (network)(
count by (instance, network)(
libp2p_peers{network!=""}
)
)
```