SLOW_OPS is triggered by op tracker, and generates a health
alert but healthchecks do not create metrics for prometheus to
use as alert triggers. This change adds SLOW_OPS metric, and
provides a simple means to extend to other relevant health
checks in the future
If the extract of the value from the health check message fails
we log an error and remove the metric from the metric set. In
addition the metric description has changed to better reflect
the scenarios where SLOW_OPS can be triggered.
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
The labels on one side do not match the labels of the other side, where
a label_replace is used. The fix uses the same label_replace on the
missing side.
Fixes: https://tracker.ceph.com/issues/47334
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
mgr/dashboard: Prometheus query error in the metrics of Pools, OSDs and RBD images
Reviewed-by: Alfonso Martínez <almartin@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Volker Theile <vtheile@suse.com>
This commit provides the Makefile to create the
ceph-grafana containers for nautilus, octopus and
master releases.
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
- Exchange read/write legends in The `I/O Bytes per second` panel.
- Rename `I/O Bytes per second` to `Throughput`.
- Rename `IOPS Count` to just `IOPS`.
- Remove instance name from legends.
- Fixes typos: `Averange` -> `Average`.
Fixes: https://tracker.ceph.com/issues/45735
Signed-off-by: Kiefer Chang <kiefer.chang@suse.com>
* RGW sync perf. counters are now exposed through grafana panels.
* Sync Performance tab is only shown if rgw realm is detected.
* Prometheus module: added metrics suitable for prometheus consumption (from existing ones, not replacing for backward compatibility).
Fixes: https://tracker.ceph.com/issues/45310
Signed-off-by: Alfonso Martínez <almartin@redhat.com>
The alert was triggered when less than 90% of OSDs were _up_, but then the
description took that value and described it as the percentage of OSDs being
_down_. So with 12% of OSDs down, the alert description would read:
```
88% or 88 of 100 OSDs are down (>=10%).
```
which can be panic-inducing.
This commit changes the alert expression to actually compute the ratio of OSDs
being down, which makes the correct value appear in the description.
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
Set decimal precision to 2 positions for charts using percentunits.
Fixes: https://tracker.ceph.com/issues/45183
Signed-off-by: Ernesto Puerta <epuertat@redhat.com>
The $ceph_hosts variable contained the FQDN for hosts
while the instance label created by ceph only has
the hostname.
Fixes: https://tracker.ceph.com/issues/44784
Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>
In the AVG Disk Utilization panel, the result is calculated
by combining the output of node_disk_io_time_seconds_total
with the output of ceph_disk_occupation. However, the
first vector encodes the instance label with the full FQDN
while the ceph label only contains the hostname:port. In
order for these to match correctly, the domain name and port
has to be stripped from the labels.
Fixes: https://tracker.ceph.com/issues/44784
Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>
When moving to LVM-based ceph-volume setups, several
grafana dashboards stopped working. The problem is that
(device, instance) no longer results in unique labels
which causes errors like:
"many-to-many matching not allowed: matching labels must be unique on one side"
Fixes: https://tracker.ceph.com/issues/44784
Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>
The references to `$osd_hosts` etc. were encoded as
`[[osd_hosts]]` in the PromQL expression divisor, and
the panel always displayed N/A as the result of the
query.
Replacing the `[[...]]` with `$...` makes the expression
work again.
Fixes: https://tracker.ceph.com/issues/44784
Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>
Fixes: https://tracker.ceph.com/issues/44623
Signed-off-by: Enno Gotthold <egotthold@suse.de>
This dashboard will per default be empty as the already existing
dashboard with the summary for all rbd images.
Prefer the non-British spelling of utilization since that's what the majority
of the code base seems to use.
Signed-off-by: Bryan Stillwell <bstillwell@godaddy.com>
mgr/dashboard: grafana charts match time picker selection.
Reviewed-by: Jan Fajerski <jfajerski@suse.com>
Reviewed-by: Laura Paduano <lpaduano@suse.com>
Reviewed-by: Patrick Seidensal <pnawracay@suse.com>
Remove shortcut menu (links) and add check in grafana CI script.
Fixes: https://tracker.ceph.com/issues/43091
Signed-off-by: Ernesto Puerta <epuertat@redhat.com>
Fixes: https://tracker.ceph.com/issues/42542
Sort order was wrong for some dashboards,
fixed empty / buggy Top 3 clients IOPS by pool / Throughput - in Pools
Overall performance
fixed Avg utilization Multiple series found - in Host Overall
performance
Fixed invalid dimensions for plot - in OSD Overall performance
Signed-off-by: Radu Toader <radu.m.toader@gmail.com>
Fix various panels that used outdated metric names, cluncky or
unnecessary label_replace calls. Also unify the style of many panels.
Fixes: http://tracker.ceph.com/issues/39652
Signed-off-by: Jan Fajerski <jfajerski@suse.com>
We are currently hosting the grafana dashboards in our repo but we do
not install them. This patch adds the cmake support.
Signed-off-by: Boris Ranto <branto@redhat.com>