Focus all tests inside a tests directory, and use pytest/tox to
perform validation of the overall content. tox tests also use
promtool if available to provide rule checks and unittest runs.
In addition to these checks a validate_rules script provides the
format, and content checks against all rules - which is also
called via tox (but can be run independently too)
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
This patch creates a health history object maintained in
the modules kvstore. The history and current health
checks are used to create a metric per healthcheck whilst
also providing a history feature. Two new commands are added:
ceph healthcheck history ls
ceph healthcheck history clear
In addition to the new commands, the additional metrics
have been used to update the prometheus alerts
Fixes: https://tracker.ceph.com/issues/52638
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
This PR intends to refactor cephfs dashboards using grafonnet
Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
This PR intends to refactor osds dashboards using grafonnet
Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
This PR intends to refactor pools dashboards using grafonnet
Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
This PR intends to refactor rbd dashboards using grafonnet
Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
This PR intends to refactor radosgw dashboards using grafonnet
Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
This PR intends to refactor hosts dashboards using grafonnet
Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
The ceph MIB has been created and maintained in a
a separate repo:
https://github.com/SUSE/prometheus-webhook-snmp
This patch brings this MIB into the main ceph repo, so
alert changes can target prometheus and potentially
SNMP environments within the same PR.
Kudos to Volker Theile for creating the MIB.
Fixes: https://tracker.ceph.com/issues/52708
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
The MTU mismatch warning was being fired for those NIC's as well that are in down state. This PR intends to fix this issue
Fixes:https://tracker.ceph.com/issues/52028
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
so we don't build this target when running "make", and hence avoid
accessing the internet in a building envronment where the internest
access is not allowed.
Signed-off-by: Kefu Chai <kchai@redhat.com>
when download/building grafonnet-lib, dpdk, spdk, liburing and fio,
they dump lots of output during configuration and building phrases,
all of which is irrelevant to us. so let's just silence it.
Signed-off-by: Kefu Chai <kchai@redhat.com>
The value we get is a perunit, so the limits and the max value should
be over 1, not 100. Note that the value being shown was correct, it
was the gauge that was not showing the correct indicators.
Signed-off-by: David Caro <david@dcaro.es>
The `instance` label is only useful if
- the exporter returns only data about its node or instance
- the exporter provides an instance label and then may return data about
other nodes
In this case, it's about the Prometheus mgr module, which is a single
exporter providing data about a whole cluster, so not only data related
to the node (or instance) the mgr module is running on. It is
completely irrelevant on which node the exporter runs on, the data
provided doesn't change. The exporter also doesn't provide `instance`
labels (which Prometheus wouldn't change due to our configuration, see
"honor_labels" setting).
(Actually there's one exception where `instance` labels are provided by
the Ceph mgr module, but that doesn't affect the Ceph Cluster
dashboard.)
Note that keeping that instance label on this particular dashboard would
enable the user to switch between a previously failed mgr instance and
the data collected from there and the currently running mgr instance (on
which the Prometheus mgr module runs on). That'd split the data, which
I don't think is a useful feature, but rather looks broken.
Fixes: https://tracker.ceph.com/issues/51212
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
The health status widget doesn't show any status because it requires its
query to return a single result. But in case a mgr instance had failed,
it would return more, provided the incident has happened in the
requested time frame.
This is simply an issue of the `instant` switch being disabled for that
widget. As only one mgr instance can ever be providing data at a time,
enabling `instant` completely solves that issue.
Fixes: https://tracker.ceph.com/issues/51212
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
Remove hard-coded timezone off Grafana dashboards to enable the Grafana
administrator to decide which timezone should be used for dashboards.
If we hard-coded those values, changing the global settings in Grafana
wouldn't have an effect. And the administrators can't change the
automatically imported Grafana dashboards provided by us.
Fixes: https://tracker.ceph.com/issues/51212
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
Convert newline character from CRLF in `rbd-details.json` to LF, so that
it will be consistent with all the other dashboard JSON files.
Fixes: https://tracker.ceph.com/issues/51212
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
Workaround https://github.com/containers/buildah/issues/3253
by pushing to a local OCI-format image to clear out erroneously-left
'parent' field in buildah commit --squash output. Can be removed
when the fix for the above is available.
Signed-off-by: Dan Mick <dmick@redhat.com>
podman login caches auth tokens in auth.json; for sudo, it may be
placed in /run/containers/0 or it may be in /run/users/0/containers;
the latter directory is removed when root "logs out", which isn't
clear what it means with sudo/su. Several builds failed because
they couldn't find the cached auth between sudo podman login and sudo
podman push. Sidestep the confusion by just using a local file for
the auth cache.
Signed-off-by: Dan Mick <dmick@redhat.com>
- allow env setting of versions of components
- add docker/quay username/password variables
- derive container version from grafana version
- make arch-specific tags
- expand clean target to remove container images
- remove release-specific targets, "all" target
- move push operations to separate "push" target
Signed-off-by: Dan Mick <dmick@redhat.com>
Some build machines don't have /etc/sub{u,g}id set up for
so-called "rootless" (non-privileged) operation. Use sudo
to avoid the need for "rootless".
Signed-off-by: Dan Mick <dmick@redhat.com>
This is a replacement dashboard configuration for the
pool overview page. It provides a cluster wide view of
capacity consumed and compression effectiveness, and
breaks this down by each pool within the configuration.
Fixes: https://tracker.ceph.com/issues/50226
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
mgr/dashboard: Fixed name clash when hostname similar to another
Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: p-se <NOT@FOUND>
The hosts-overview Grafana dashboard json file contains a repeated element, making
it invalid JSON. Some JSON parsers handle this. However, this prevents Jsonnet
from parsing the dashboard, which prevents the deployment of this dashboard via
Jsonnet.
Fixes: https://tracker.ceph.com/issues/50410
Signed-off-by: Malcolm Holmes <mdh@odoko.co.uk>
run-promtool-unittests is failing with difference in floating point values in some complex calculations. This PR intends to simplify those calculations and fix this issue.
Fixes: https://tracker.ceph.com/issues/49952
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
This PR intends to add unit testing for prometheus rules using promtool. To run the tests run 'run-promtool-unittests.sh' file.
Fixes: https://tracker.ceph.com/issues/45415
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
mgr/dashboard: prometheus alerting: add some leeway for package drops and errors
Reviewed-by: Stephan Müller <smueller@suse.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
This PR intends to alert a user if a specific network is configured with a custom MTU
Fixes: https://tracker.ceph.com/issues/48748
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
SLOW_OPS is triggered by op tracker, and generates a health
alert but healthchecks do not create metrics for prometheus to
use as alert triggers. This change adds SLOW_OPS metric, and
provides a simple means to extend to other relevant health
checks in the future
If the extract of the value from the health check message fails
we log an error and remove the metric from the metric set. In
addition the metric description has changed to better reflect
the scenarios where SLOW_OPS can be triggered.
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
The labels on one side do not match the labels of the other side, where
a label_replace is used. The fix uses the same label_replace on the
missing side.
Fixes: https://tracker.ceph.com/issues/47334
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
mgr/dashboard: Prometheus query error in the metrics of Pools, OSDs and RBD images
Reviewed-by: Alfonso Martínez <almartin@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Volker Theile <vtheile@suse.com>
This commit provides the Makefile to create the
ceph-grafana containers for nautilus, octopus and
master releases.
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
- Exchange read/write legends in The `I/O Bytes per second` panel.
- Rename `I/O Bytes per second` to `Throughput`.
- Rename `IOPS Count` to just `IOPS`.
- Remove instance name from legends.
- Fixes typos: `Averange` -> `Average`.
Fixes: https://tracker.ceph.com/issues/45735
Signed-off-by: Kiefer Chang <kiefer.chang@suse.com>
* RGW sync perf. counters are now exposed through grafana panels.
* Sync Performance tab is only shown if rgw realm is detected.
* Prometheus module: added metrics suitable for prometheus consumption (from existing ones, not replacing for backward compatibility).
Fixes: https://tracker.ceph.com/issues/45310
Signed-off-by: Alfonso Martínez <almartin@redhat.com>
The alert was triggered when less than 90% of OSDs were _up_, but then the
description took that value and described it as the percentage of OSDs being
_down_. So with 12% of OSDs down, the alert description would read:
```
88% or 88 of 100 OSDs are down (>=10%).
```
which can be panic-inducing.
This commit changes the alert expression to actually compute the ratio of OSDs
being down, which makes the correct value appear in the description.
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
Set decimal precision to 2 positions for charts using percentunits.
Fixes: https://tracker.ceph.com/issues/45183
Signed-off-by: Ernesto Puerta <epuertat@redhat.com>
The $ceph_hosts variable contained the FQDN for hosts
while the instance label created by ceph only has
the hostname.
Fixes: https://tracker.ceph.com/issues/44784
Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>
In the AVG Disk Utilization panel, the result is calculated
by combining the output of node_disk_io_time_seconds_total
with the output of ceph_disk_occupation. However, the
first vector encodes the instance label with the full FQDN
while the ceph label only contains the hostname:port. In
order for these to match correctly, the domain name and port
has to be stripped from the labels.
Fixes: https://tracker.ceph.com/issues/44784
Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>
When moving to LVM-based ceph-volume setups, several
grafana dashboards stopped working. The problem is that
(device, instance) no longer results in unique labels
which causes errors like:
"many-to-many matching not allowed: matching labels must be unique on one side"
Fixes: https://tracker.ceph.com/issues/44784
Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>
The references to `$osd_hosts` etc. were encoded as
`[[osd_hosts]]` in the PromQL expression divisor, and
the panel always displayed N/A as the result of the
query.
Replacing the `[[...]]` with `$...` makes the expression
work again.
Fixes: https://tracker.ceph.com/issues/44784
Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>
Fixes: https://tracker.ceph.com/issues/44623
Signed-off-by: Enno Gotthold <egotthold@suse.de>
This dashboard will per default be empty as the already existing
dashboard with the summary for all rbd images.
Prefer the non-British spelling of utilization since that's what the majority
of the code base seems to use.
Signed-off-by: Bryan Stillwell <bstillwell@godaddy.com>
mgr/dashboard: grafana charts match time picker selection.
Reviewed-by: Jan Fajerski <jfajerski@suse.com>
Reviewed-by: Laura Paduano <lpaduano@suse.com>
Reviewed-by: Patrick Seidensal <pnawracay@suse.com>
Remove shortcut menu (links) and add check in grafana CI script.
Fixes: https://tracker.ceph.com/issues/43091
Signed-off-by: Ernesto Puerta <epuertat@redhat.com>
Fixes: https://tracker.ceph.com/issues/42542
Sort order was wrong for some dashboards,
fixed empty / buggy Top 3 clients IOPS by pool / Throughput - in Pools
Overall performance
fixed Avg utilization Multiple series found - in Host Overall
performance
Fixed invalid dimensions for plot - in OSD Overall performance
Signed-off-by: Radu Toader <radu.m.toader@gmail.com>
Fix various panels that used outdated metric names, cluncky or
unnecessary label_replace calls. Also unify the style of many panels.
Fixes: http://tracker.ceph.com/issues/39652
Signed-off-by: Jan Fajerski <jfajerski@suse.com>
We are currently hosting the grafana dashboards in our repo but we do
not install them. This patch adds the cmake support.
Signed-off-by: Boris Ranto <branto@redhat.com>
Linewidth was set to two, but the idea is that
a linewidth of >1 is reserved for eye-catcher
plot lines like maximums
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
These new dashboard definitions provide the high
level views for the hosts in the cluster and the
OSDs.
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
The host-details.json file provides a view of host
level metrics. The panels are arranged in two
rows;
Overview : Cpu/RAM/Network related stats
OSD Performance: OSD physical drive stats
The overview row is shown by default. Click on
the OSD Performance row to show the remaining
graphs
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>