Following PR https://github.com/ceph/ceph/pull/55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.
Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
Currently the alert generator is broken if you try to run `tox
-ealerts-fix`. I fixed it and ran the command and it built a new json
file as well.
Signed-off-by: Nizamudeen A <nia@redhat.com>
File 'prometheus_alerts.yml' file should not be edited directly.
The changes should be added to 'prometheus_alerts.libsonnet' file
(and/or any other appropriate lib/j sonnet files) and generated
using 'make generate' command.
Adding all the changes to 'prometheus_alerts.libsonnet' file and
building/generating the prometheus_alerts YAML file.
PS: all the changes seen in 'prometheus_alerts.yml' file is due
to the re-arrangement of lines. The file remains same.
Signed-off-by: Arun Kumar Mohan <amohan@redhat.com>
Until now daemon health metrics were stored without being used. One of
the most helpful metrics there is SLOW_OPS with respect to OSDs and MONs
which this commit tries to expose to bring fine grained metrics to find
troublesome OSDs instead of having a lone healthcheck of slow ops in the
whole cluster.
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
Currently there is no alert for a network interface card to be misconfigured or
failed which is part of a network bond.
This could lead to redundancies and performance being degraded unnoticed.
To solve this, I use node exporter metrics to look at the number of total peers
of the bond and the ones that are active. If the numbers differ, something is up
and should be looked at.
Fixes: https://tracker.ceph.com/issues/57962
Signed-off-by: Christian Kugler <syphdias+git@gmail.com>