Improve the AlertmanagerMembersInconsistent alert

The expression alertmanager_cluster_members{job="alertmanager"}[5m]) is assumed to return
one series for each alertmanager instance in the cluster. When running inside Kubernetes,
alertmanager pods can get evicted and rescheduled. This can change the instance label and
produce a new series for that alertmanager instance.

When the same pod gets evicted several times in a row, there will be a short interval in which
Prometheus will return values from both the new series and the old series.
As a result, counting the number of series for the alertmanager_cluster_members metric
will overestimate the number of instances in the given cluster.

This commit modifies the the AlertmanagerMembersInconsistent alert to increase the for clause to 15m
in order to reduce the probability of a false positive.

Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>
This commit is contained in:
fpetkovski 2021-06-10 13:26:36 +02:00
parent 58169c1412
commit b408b522bc
1 changed files with 1 additions and 1 deletions

View File

@ -29,7 +29,7 @@
< on (%(alertmanagerClusterLabels)s) group_left
count by (%(alertmanagerClusterLabels)s) (max_over_time(alertmanager_cluster_members{%(alertmanagerSelector)s}[5m]))
||| % $._config,
'for': '10m',
'for': '15m',
labels: {
severity: 'critical',
},