Until now daemon health metrics were stored without being used. One of
the most helpful metrics there is SLOW_OPS with respect to OSDs and MONs
which this commit tries to expose to bring fine grained metrics to find
troublesome OSDs instead of having a lone healthcheck of slow ops in the
whole cluster.
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
Do only use `instance` to query for hostnames in single-cluster-mode.
Consider the cluster matcher only in multi-cluster-mode. In this case
the query will look like:
`"label_values({cluster=~\"$cluster\"}, instance)"`.
Fixes: https://tracker.ceph.com/issues/57987
Signed-off-by: Tatjana Dehler <tdehler@suse.com>
Add the missing information about the RGW instance to the labels of the
"Average GET/PUT Latencies" panel on the "RGW Overview" dashboard.
Fixes: https://tracker.ceph.com/issues/57166
Signed-off-by: Tatjana Dehler <tdehler@suse.com>
Correct a wrongly set bracket on ceph-dashboard -> OSD Overview ->
OSD Objectstore Types resulting in a parser error.
Fixes: https://tracker.ceph.com/issues/56948
Signed-off-by: Tatjana Dehler <tdehler@suse.com>
Before this refactor we couln't override the config externally. Now the
_config is correctly propagated and not only taken from the
config.libsonnet file.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
Fix most of the issues reported by dashboards-linter:
- Add matcher/template for job (and also cluster)
- use $__rate_interval everywhere
Also this change all the irate functions to rate as most of irate where
not actually used correctly. While using irate on graph for instance you
can easily miss some of the metrics values as irate only take the two
last values and the query steps can be quite large if you want a graph
for a few hours/a day or more.
Fixes: https://tracker.ceph.com/issues/55003
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
ceph-mixin: add config with matchers and tags
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
Mixin is a way to bundle dashboards, prometheus rules and alerts into
jsonnet package. Shifting to mixin will allow easier integration with
monitoring automation that some users may use.
This commit moves `/monitoring/grafana/dashboards` and
`/monitoring/prometheus` to `/monitoring/ceph-mixin`. Prometheus alerts
was also converted to Jsonnet using an automated way (from yaml to json
to jsonnet). This commit minimises any change made to the generated files
and should not change neithers the dashboards nor the Prometheus alerts.
In the future some configuration will also be added to jsonnet to add
more functionalities to the dashboards or alerts (i.e.: multi cluster).
Fixes: https://tracker.ceph.com/issues/53374
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>