From b408b522bc653d014e53035e59fa394cc1edd762 Mon Sep 17 00:00:00 2001 From: fpetkovski Date: Thu, 10 Jun 2021 13:26:36 +0200 Subject: [PATCH] Improve the AlertmanagerMembersInconsistent alert The expression alertmanager_cluster_members{job="alertmanager"}[5m]) is assumed to return one series for each alertmanager instance in the cluster. When running inside Kubernetes, alertmanager pods can get evicted and rescheduled. This can change the instance label and produce a new series for that alertmanager instance. When the same pod gets evicted several times in a row, there will be a short interval in which Prometheus will return values from both the new series and the old series. As a result, counting the number of series for the alertmanager_cluster_members metric will overestimate the number of instances in the given cluster. This commit modifies the the AlertmanagerMembersInconsistent alert to increase the for clause to 15m in order to reduce the probability of a false positive. Signed-off-by: fpetkovski --- doc/alertmanager-mixin/alerts.libsonnet | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/alertmanager-mixin/alerts.libsonnet b/doc/alertmanager-mixin/alerts.libsonnet index a60428a1..720e411a 100644 --- a/doc/alertmanager-mixin/alerts.libsonnet +++ b/doc/alertmanager-mixin/alerts.libsonnet @@ -29,7 +29,7 @@ < on (%(alertmanagerClusterLabels)s) group_left count by (%(alertmanagerClusterLabels)s) (max_over_time(alertmanager_cluster_members{%(alertmanagerSelector)s}[5m])) ||| % $._config, - 'for': '10m', + 'for': '15m', labels: { severity: 'critical', },