Until now daemon health metrics were stored without being used. One of
the most helpful metrics there is SLOW_OPS with respect to OSDs and MONs
which this commit tries to expose to bring fine grained metrics to find
troublesome OSDs instead of having a lone healthcheck of slow ops in the
whole cluster.
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
Prometheus reports an error - many-to-many matching not allowed: matching labels must be unique on one side for CephPoolGrowthWarning if we have same pool ids on two different instances.
Fixes: https://tracker.ceph.com/issues/58017
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
Currently there is no alert for a network interface card to be misconfigured or
failed which is part of a network bond.
This could lead to redundancies and performance being degraded unnoticed.
To solve this, I use node exporter metrics to look at the number of total peers
of the bond and the ones that are active. If the numbers differ, something is up
and should be looked at.
Fixes: https://tracker.ceph.com/issues/57962
Signed-off-by: Christian Kugler <syphdias+git@gmail.com>
After https://github.com/ceph/ceph/pull/44059 the monitoring/prometheus
and monitoring/grafana/dashboards directories are changed to
monitoring/ceph-mixins. That broke the shared_folders in the cephadm
bootstrap script.
Changed all the instances of monitoring/prometheus and
monitoring/grafana/dashboards to monitoring/ceph-mixins
Also, renaming all the instances of prometheus_alerts.yaml to
prometheus_alerts.yml.
Fixes: https://tracker.ceph.com/issues/54176
Signed-off-by: Nizamudeen A <nia@redhat.com>
Mixin is a way to bundle dashboards, prometheus rules and alerts into
jsonnet package. Shifting to mixin will allow easier integration with
monitoring automation that some users may use.
This commit moves `/monitoring/grafana/dashboards` and
`/monitoring/prometheus` to `/monitoring/ceph-mixin`. Prometheus alerts
was also converted to Jsonnet using an automated way (from yaml to json
to jsonnet). This commit minimises any change made to the generated files
and should not change neithers the dashboards nor the Prometheus alerts.
In the future some configuration will also be added to jsonnet to add
more functionalities to the dashboards or alerts (i.e.: multi cluster).
Fixes: https://tracker.ceph.com/issues/53374
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>