Commit Graph

3 Commits

Author SHA1 Message Date
fpetkovski b408b522bc Improve the AlertmanagerMembersInconsistent alert
The expression alertmanager_cluster_members{job="alertmanager"}[5m]) is assumed to return
one series for each alertmanager instance in the cluster. When running inside Kubernetes,
alertmanager pods can get evicted and rescheduled. This can change the instance label and
produce a new series for that alertmanager instance.

When the same pod gets evicted several times in a row, there will be a short interval in which
Prometheus will return values from both the new series and the old series.
As a result, counting the number of series for the alertmanager_cluster_members metric
will overestimate the number of instances in the given cluster.

This commit modifies the the AlertmanagerMembersInconsistent alert to increase the for clause to 15m
in order to reduce the probability of a false positive.

Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>
2021-06-22 08:21:02 +02:00
Björn Rabenstein ce108378d4
Fix and improve AlertmanagerClusterFailedToSendAlerts (#2437)
The alert was just looking at the minimum across integrations. So a
complete failure of one integration would be masked by a still worknig
other integration. With this fix, the `integration` label is retained
(as it was already expected by the `description`), and thus any
failing integration will trigger the alert.

In addition, an `alertmanagerCriticalIntegrationsRegEx` is provided
that allows to mark integrations as critical. Integrations that are
not used to deliver critical alerts, or those that are just there for
auditing and logging purposes can now be configured to only trigger a
warning alert if they fail.

Signed-off-by: beorn7 <beorn@grafana.com>
2020-12-23 15:15:38 +01:00
Tom Wilkie 6c5dee008f
Beginnings of an Alertmanager mixin. (#1629)
Add an Alertmanager mixin

Signed-off-by: beorn7 <beorn@grafana.com>
Co-authored-by: Tom Wilkie <tom.wilkie@gmail.com>
Co-authored-by: beorn7 <beorn@grafana.com>
Co-authored-by: Simon Pasquier <spasquie@redhat.com>
2020-12-03 15:57:42 +01:00